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Escaping From Saddle Points - 
Online Stochastic Gradient for Tensor Decomposition 

Rong Ge* Furong Huang t Chi Jin * Yang Yuan § 


Abstract 

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non- 
convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient 
updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex 
problem that allows for efficient optimization. Using this property we show that stochastic gradient 
descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge 
this is the first work that gives global convergence guarantees for stochastic gradient descent on non- 
convex functions with exponentially many local minima and saddle points. 

Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning 
a rich class of latent variable models. We propose a new optimization formulation for the tensor 
decomposition problem that has strict saddle property. As a result we get the first online algorithm 
for orthogonal tensor decomposition with global convergence guarantee. 


1 Introduction 


Stochastic gradient descent is one of the basic algorithms in optimization. It is often used to solve the 
following stochastic optimization problem 


w = arg min f(w), where f(w) = E x ^p>[(j)(w, x)\ 

w£R d 


( 1 ) 


Here x is a data point that comes from some unknown distribution T>, and U is a loss function that is defined 
for a pair (x, w). We hope to minimize the expected loss E [<j)(w, a;)]. 


W hen the function f(w) i s conv ex, convergence of stochastic gradient descent is well-understood (IRakhlin et al 
2012; Shalev-Shwartz et al. . 200 9). However, stochastic gradient descent is not only limited to convex 
functions. Especially, in the context of n eural networks, stochastic gradient descent is known as the “back- 
propagation” algorithm (IRumelhart et al., 1988), and has been the main algorithm that underlies the success 
of deep learning (Bengio, 20091) However, the guarantees in the convex setting does not transfer to the 
non-convex settings. 

Optimizing a non-convex function is NP-hard in general. The difficulty comes from two aspects. First, 
a non-convex function may have many local minima, and it might be hard to find the best one (global 
minimum) among them. Second, even finding a local minimum might be hard as there can be many 
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saddle points which have O-gradient but are not local mininuQ. In the most general case, there is no 
known algorithm that guarantees to find a local minimum in polynomial number of steps. The discrete 
analog (finding local minimum in domains like {0,1}™) has been studied in complexity theory and is PLS- 
complete (IJohnson et all- 


in many cases, especia lly in those related to deep neural networks (IDauphin et all 120141) 


(iChoromanska et al.U20 141) . the main bottleneck in optimization is not due to local minima, but the existence 


of many saddle points. Gradient based algorithms are in particular susceptible to saddle point problems as 
they only rely on the gradient information. The_saddle point problem is alleviated for second-order methods 
that also rely on the Hessian information (IDauphin et ah. 2014). 

However, using Hessian information usually increases the memory requirement and computation time 
per iteration. As a result many applications still use stochastic gradient and empirically get reasonable 
results. In this paper we investigate why stochastic gradient methods can be effective even in presence of 
saddle point, in particular we answer the following question: 


Question: Given a non-convex function / with many saddle points, what properties of / will guarantee 
stochastic gradient descent to converge to a local minimum efficiently? 


We identify a property of non-convex functions which we call strict saddle. Intuitively, this property 
guarantees local progress if we have access to the Hessian information. Surprisingly we show with only 
first order (gradient) information, stochastic gradient can escape the saddle points efficiently. We give a 
framework for analyzing stochastic gradient in both unconstrained and equality-constrained case using this 
property. 

We apply our framework to orthogonal tensor decomposition, which is a core problem in learning many 
latent variable models (see discussion in 12.21) . The tensor decomposition problem is inherently susceptible 
to the saddle point issues, as the problem asks to find d different components and any permutation of the 
true components yields a valid solution. Such symmetry creates exponentially many local minima and 
saddle points in the optimization problem. Using our new analysis of stochastic gradient, we give the first 
online algorithm for orthogonal tensor decomposition with global convergence guarantee. This is a key step 
towards making tensor decomposition algorithms more scalable. 


1.1 Summary of Results 


Strict saddle functions Given a function f(w) that is twice differentiable, we call w a stationary point if 
V/(m) = 0. A stationary point can either be a local minimum, a local maximum or a saddle point. We 
identify an interesting class of non-convex functions which we call strict saddle. For these functions the 
Hessian of every saddle point has a negative eigenvalue. In particular, this means that local second-order 
algorithms which are similar to the ones in (Dauphin et ah . 2014) can always make some progress. 

It may seem counter-intuitive why stochastic gradient can work in these cases: in particular if we run 
the basic gradient descent starting from a stationary point then it will not move. However, we show that 
the saddle points are not stable and that the randomness in stochastic gradient helps the algorithm to escape 
from the saddle points. 


Theorem 1 (informal). Suppose f(w) is strict saddle (see Definition [5]), Noisy Gradient Descent (Algo¬ 
rithm Q} outputs a point that is close to a local min imum in polynomial number of steps. 

’See Section[3]for definition of saddle points. 
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Online tensor decomposition Requiring all saddle points to have a negative eigenvalue may seem strong, 
but it already allows non-trivial applications to natural non-convex optimization problems. As an example, 
we consider the orthogonal tensor decomposition problem. This problem is the key step in spectral learning 
for many latent variable models (see more discussions in Section 12.21) . 

We design a new objective function for tensor decomposition that is strict saddle. 

Theorem 2. Given random samples X such that T = E 7/ (X) ] € W 1 ' is an orthogonal 4-th order tensor 
(see Section 1221 ). there is an objective function f(w ) = K[f(w,X)] w € M. dxd such that every local 
minimum of f(w) corresponds to a valid decomposition ofT. Further, function f is strict saddle. 

Combining this new objective with our framework for analyzing stochastic gradient in non-convex 
setting, we get the first online algorithm for orthogonal tensor decomposition with global convergence 
guarantee. 


1.2 Related Works 


Relaxed notions of convexity In optimization theory and economics, there are extensive works on under¬ 
standing functions that behave similarly to con vex fu nctions (and in particular can be optimized efficiently). 
Such notions i nvolve ps eudo-convexity (IMangasarianl . ll965h . quasi-convexity 

( Kiwiel . 2001), invexity(Hanson, 1999 ) and their variants. More recently there are also works that consider 
c lasse s that admit more efficient optimization procedures like RSC (restricted strong convexity) (Ag arwal et al 
20101). Although these classes involve functions that are non-convex, the function (or at least the function re¬ 
stricted to the region of analysis) still has a unique stationary point that is the desired local/global minimum. 
Therefore these works cannot be used to prove global convergence for problems like tensor decomposition, 
where by symmetry of the problem there are multiple local minima and saddle points. 


Second-order algorithms The most popular second-order method is the Newton’s method. Although 
Newton’s method converges fast near a local minimum, its_global co nverge nce properties are less understood 
in the more general case. For non-convex functions, ( Frieze et ah . 1996 ) gave a concrete example where 
second-order method converges to the desired local minimum in polynomial number of steps (interestingly 
the function of interest is trying to find one component in a 4-th order orthogonal tensor, which is a simpler 
case of our application). As Newton’s method often co nverges also to saddle points, to avoid this behavior, 


different trusted-region algorithms are applied (iDauphin et al . J2014 ). 


Stochastic gradient and symmetry The tensor decomposition problem we consider in this paper has 
the following symmetry: the solution is a set of d vectors vi, ...,Vd- If {v\,V2, ■■■,Vd) is a solution, then 
for any permutation ir and any sign flips k € {±1 } d , (.., KiV^o), ■■■) is also a valid solution. In general, 
symmetry is kno wn to generate saddle p o ints, and variants of gr adient descent ofte n perform reasonably in 
these cases (see ( Saad and Sofia . 1995 ). ( Rattrav et al. . 1998 ). ( Inoue et al. . 2003 )). The settings in these 
work are different from ours, and none of them give bounds on number of steps required for convergence. 

There are many other problems that have the sam e sy mmetric structure as the tensor decomposition 
problem including the sparse coding problem (IQlshausen and Field! 1 997 ) and many deep learning appli¬ 
cations (Bcngio: . 1200% . In these problems the goal is to learn multiple “features” where the solution is 
invariant under permutation. Note that there are many recent papers on iterative/gradient based algorithms 
for problems related to matrix factorization (Jain et al., 2013; Sa xe et al., 20131) . These problems often have 
very different symmetry, as if Y = AX then for any invertible matrix R we know Y = (AR)(R~ 1 X). 
In this case all the equivalent solutions are in a connected low dimensional manifold and there need not be 
saddle points between them. 
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2 Preliminaries 


Notation Throughout the paper we use [d] to denote set {1,2..... d). We use || ■ || to denote the l 2 norm 
of vectors and spectral norm of matrices. For a matrix we use \ m in to denote its smallest eigenvalue. For a 
function / : M. d —> M, V/ and V 2 / denote its gradient vector and Hessian matrix. 

2.1 Stochastic Gradient Descent 

The stochastic gradient aims to solve the stochastic optimization problem (0Q), which we restate here: 

w = arg min f(w), where f(w) = E xr ^x>\(p(w,x)]. 

w£R d 

Recall (p(w. x ) denotes the loss function evaluated for sample x at point w. The algorithm follows a 
stochastic gradient 

w t+ i =w t - rjV Wt (f>(wt,x t ), (2) 

where xt is a random sample drawn from distribution V and // is the learning rate. 

In the more general setting, stochastic gradient descent can be viewed as optimizing an arbitrary function 
f(w) given a stochastic gradient oracle. 

Definition 3. For a function f(w) : W 1 —> R, a function SG(w) that maps a variable to a random vector in 
is a stochastic gradient oracle if E[S'G(u;)] = V/(m) and ||S’G(n;) — V/(m)|| < Q. 

In this case the update step of the algorithm becomes wt+i = wt — r}SG(wt). 


Smoothness and Strong Convexity Traditional analysis for stochastic gradient often assumes the function 
is smooth and strongly convex. A function is /3-smooth if for any two points w \, W 2 , 

l|V/(mi) - V/0 2 )|| < /3||mi - w 2 1|. (3) 


When / is twice differentiable this is equivalent to assuming that the spectral norm of the Hessian matrix is 
bounded by j3. We say a function is a-strongly convex if the Hessian at any point has smallest eigenvalue at 
least Oi (V 2 /O)) > a). _ 


Using these two properties, previous work (IRakhlin et all l2012t) shows that stochastic gradient con¬ 


verges at a rate of 1 /t. In this paper we consider non-convex functions, which can still be /3-smooth but 
cannot be strongly convex. 


Smoothness of Hessians We also require the Hessian of the function / to be smooth. We say a function 
f(w) has p-Lipschitz Hessian if for any two points w\,w 2 we have 

||V 2 /(mi) - V 2 /(m 2 )|| < p\\wi - w 2 \\. (4) 

This is a third order condition that is true if the third order derivative exists and is bounded. 
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2.2 Tensors decomposition 

A p-th order tensor is a /^-dimensional array. In this paper we will mostly consider 4-th order tensors. If 
T £ is a 4-th order tensor, we use 2^ j 2j j 3 j 4 (*i, £ [d]) to denote its (ii, i 2 ,13 , * 4 )-th entry. 

Tensors can be constructed from tensor products. We use (u ® v) to denote a 2nd order tensor where 
(u ® v)i t j = UiVj. This generalizes to higher order and we use u®' 1 to denote the 4-th order tensor 

1a ]*1,*2,*3>*4 Wl U%2u u . 


j4 

We say a 4-th order tensor T £ R has an orthogonal decomposition if it can be written as 

T = j^af\ (5) 

2=1 

where s are orthonormal vectors that satisfy | a, | = 1 and afaj = 0 for i / j. We call the vectors 
a,i s the components of this decomposition. Such a decomposition is unique up to permutation of a,’s and 
sign-flips. 

A tensor also defines a multilinear form (just as a matrix defines a bilinear form), for a p -th order tensor 

T £ M rfP and matrices M* € M dxni i £ \p], we define 


[T(M 1; M 2 , M p )\ 




E ^ 


*e[p] 


That is, the result of the multilinear form T(Mi, M 2 ,..., M p ) is another tensor in R" 1 xr ' 2X '"X n p. We will 
most often use vectors or identity matrices in the multilinear form. In particular, for a 4-th order tensor 


T £ 


pd 4 


we know T(J, u, u, u ) is a vector and T(J, I, u, u) is a matrix. In particular, if T has the orthogonal 

d T..nA — \^ d l a ,T „ \2 „ . n T 


decomposition in (01, we know T(I, u, u, u) = J2i=i( ul a i) s cii and T(I, I, u, u ) = J2i=i( ul a i) 2 a i a i • 

Given a tensor T with an orthogonal decomposition, the orthogonal tensor decomposition problem asks 
to find the individual components ai,...,a,d. This is a central problem in learning many latent variable 
models, including Hidden Markov Model, multi-view models, topic models, mi xture of Gaussians and 


Independent Component Analysis (ICA). See the discussion and citations in An andk umar et ah (120141) . 


Orthogonal tensor decomposition problem can be solved by many algorithms even when the input is a noisy 

estimation T « T ( 

Harshman, 

1970; 

Colda, 

2001; Anandkumar et al. 

2014). Tr 

practice this approach 

has been successfully applied to ICA 

(ComonL 

2002), topic models 

Zou et al.. 

2013 

) and community 


detection ( [Huang et al. . 2013). 


3 Stochastic gradient descent for strict saddle function 

In this section we discuss the properties of saddle points, and show if all the saddle points are well-behaved 
then stochastic gradient descent finds a local minimum for a non-convex function in polynomial time. 


3.1 Strict saddle property 

For a twice differentiable function f(w), we call the points stationary points if their gradients are equal 
to 0. Stationary points_could be local minima, local maxima or saddle points. By local optimality condi¬ 
tions (Wright and Nocedal . 199911 . in many cases we can tell what type a point w is by looking at its Hessian: 
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if V 2 /(u;) is positive definite then w is a local minimum; if V 2 /('«;) is negative definite then w is a local 
maximum; if V 2 /(m) has both positive and negative eigenvalues then w is a saddle point. These criteria do 
not cover all the cases as there could be degenerate scenarios: V 2 /(m) can be positive semidefinite with an 
eigenvalue equal to 0, in which case the point could be a local minimum or a saddle point. 

If a function does not have these degenerate cases, then we say the function is strict saddle: 

Definition 4. A twice differentiable function f(w) is strict saddle, if all its local minima have V 2 /(u;) 0 

and all its other stationary points satisfy A r „ m (V 2 /('«;)) < 0. 

Intuitively, if we are not at a stationary point, then we can always follow the gradient and reduce the 
value of the function. If we are at a saddle point, we need to consider a second order Taylor expansion: 

/( w + Aw) « w + (Aro) T V 2 /(w)(Aro) + 0(||Au>|| 3 ). 

Since the strict saddle property guarantees V 2 /(iu) to have a negative eigenvalue, there is always a point 
that is near w and has strictly smaller function value. It is possible to make local improvements as long as 
we have access to second order information. However it is not clear whether the more efficient stochastic 
gradient updates can work in this setting. 

To make sure the local improvements are significant, we use a robust version of the strict saddle property: 

Definition 5. A twice differentiable function f(w) is (a, 7, e, 5)-strict saddle, if for any point w at least one 
of the following is true 

1- l|V/H||>e. 

2- A miniy 2 f{w)) < -7. 

3. There is a local minimum w * such that \\w — u;*| < 5, and the function /('«/) restricted to 25 
neighborhood of w* (Urn' — io*|| < 25) is a-strongly convex. 

Intuitively, this condition says for any point whose gradient is small, it is either close to a robust local 
minimum, or is a saddle point (or local maximum) with a significant negative eigenvalue. 


Algorithm 1 Noisy Stochastic Gradient 

Require: Stochastic gradient oracle SG(w), initial point wq, desired accuracy n. 
Ensure: wt that is close to some local minimum w*. 

1: Choose r] = rnin{0(K 2 /log(l/fi:)),7 max }, T = 0(l/r? 2 ) 

2: for t = 0 to T — 1 do 

3: Sample noise n uniformly from unit sphere. 

4: w t+ 1 <-w t - f](SG(w) + n ) 


We puipose a simple variant of stochastic gradient algorithm, where the only difference to the traditional 
algorithm is we add an extra noise term to the updates. The main benefit of this additional noise is that we 
can guarantee there is noise in every direction, which allows the algorithm to effectively explore the local 
neighborhood around saddle points. If the noise from stochastic gradient oracle already has nonnegligible 
variance in every direction, our analysis also applies without adding additional noise. We show noise can 
help the algorithm escape from saddle points and optimize strict saddle functions. 
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Theorem 6 (Main Theorem). Suppose a function f(w) : —>• R that is (a , 7, e, 5)-strict saddle, and has a 

stochastic gradient oracle with radius at most Q. Further, suppose the function is bounded by \f(w)\ < B, 
is [3-smooth and has p-Lipschitz Hessian. Then there exists a threshold rj ma x = 0(1), so that for any 
C > 0, and for any 7 < p ma . x / max{l, log(l/C)}, with probability at least 1 — £ in t = Ofrj~ 2 log(l/C)) 
iterations, Algorithm®( Noisy Gradient Descent) outputs a point wt that is O ( \Jr\ log (1 jrf)) -close to some 
local minimum w *. 

Here (and throughout the rest of the paper) Of) (bl. 0) hides the factor that is polynomially dependent 
on all other parameters (including Q, 1/a, I/ 7 , 1/e, 1/5, B, [3, p, and d), but independent of 7 and (. 
So it focuses on the dependency on 7 and f. Our proof technique can give explicit dependencies on these 
parameters however we hide these dependencies for simplicity of presentation. 

Remark (Decreasing learning rate). Often analysis of stochastic gradient descent uses decreasing learning 
rates and the algorithm converges to a local {or global) minimum. Since the function is strongly convex in 
the small region close to local minimum, we can use Theorem ® to first find a point that is close to a local 
minimum, and then apply standard analysis of SGD in the strongly convex case (where we decrease the 
learning rate by 1 /t and get 1 /y/t convergence in ||m — 

In the next part we sketch the proof of the main theorem. Details are deferred to Appendix [A] 


3.2 Proof sketch 


In order to prove Theorem [ 6 ] we analyze the three cases in Definition [5] When the gradient is large, we 
show the function value decreases in one step (see Lemma |7J; when the point is close to a local minimum, 
we show with high probability it cannot escape in the next polynomial number of iterations (see Lemma [ 8 ]). 

Lemma 7 (Gradient). Under the assumptions of Theorem® for any point with \\X7 f(wt)\\ > C^fr\ (where 
C = 0(1)) and C^frj < e, after one iteration we have E[/(w;t_|_i)] < ffwf) — f2( rj 2 ). 


The proof of this lemma is a simple application of the smoothness property. 


Lemma 8 (Local minimum). Under the assumptions of Theorem® for any point wt that is O(^frj) < 5 
close to local minimum w*, in 0 (r ]~ 2 log(l/£)) number of steps all future Wt+i’s are 0 (y/r] log(l/ 7 </))- 
close with probability at least 1 — £/ 2 . 


The proof of this lemma is similar to the standard analysis (Rakhlin et al., (2012) of stochastic gradient 
descent in the smooth and strongly convex setting, except we only have local strongly convexity. The proof 
appears in Appendix [A] 

The hardest case is when the point is “close” to a saddle point: it has gradient smaller than e and 
smallest eigenvalue of the Hessian bounded by —7. In this case we show the noise in our algorithm helps 
the algorithm to escape: 


Lemma 9 (Saddle point). Under the assumptions of Theorem® for any point wt where | V/'('«;/,) | < C 
(for the same C as in Lemma [71), and A mm (V 2 /('«y)) < —7, there is a number of steps T that depends on 
Wt such that E [f(wt+T)] ^ f( w t) — ^( 7 ). The number of steps T has a fixed upper bound T max that is 
independent of wt where T < T max = 0 ( 1 / 7 ). 


Intuitively, at point wt there is a good direction that is hiding in the Hessian. The hope of the algorithm 
is that the additional (or inherent) noise in the update step makes a small step towards the correct direction, 
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and then the gradient information will reinforce this small perturbation and the future updates will “slide” 
down the correct direction. 

To make this more formal, we consider a coupled sequence of updates w such that the function to 
minimize is just the local second order approximation 

f(w) = f(w t ) + V f{w t ) T iw - w t ) + - w t ) T V 2 f(wt)(w - w t ). 

The dynamics of stochastic gradient descent for this quadratic function is easy to analyze as wt+i can be 
calculated analytically. Indeed, we show the expectation of f(w) will decrease. We then use the smoothness 
of the function to show that as long as the points did not go very far from wt, the two update sequences w 
and w will remain close to each other, and thus f(w t +i ) ~ /('«'/.+?)• Finally we prove the future //;/+,:’s (in 
the next T steps) will remain close to wt with high probability by Martingale bounds. The detailed proof 
appears in Appendix lAl 

With these three lemmas it is easy to prove the main theorem. Intuitively, as long as there is a small 
probability of being Of^/rj/close to a local minimum, we can always apply Lemma[7]or Lemma[9]to make 
the expected function value decrease by (l{rj) in at most 0(1/ 77 ) iterations, this cannot go on for more 
than 0 (l/? 7 2 ) iterations because in that case the expected function value will decrease by more than 2 B, 
but max//) — min f(x) < 2 B by our assumption. Therefore in 0(1///) steps with at least constant 
probability wt will become 0(y/rj )-close to a local minimum. By Lemma [8] we know once it is close it 
will almost always stay close, so we can repeat this log(l/C) times to get the high probability result. More 
details appear in Appendix [A] 

3.3 Constrained Problems 

In many cases, the problem we are facing are constrained optimization problems. In this part we briefly 
describe how to adapt the analysis to problems with equality constraints (which suffices for the tensor 
application). Dealing with general inequality constraint is left as future work. 

For a constrained optimization problem: 

min f{w) (6) 

s.t. Ci(w) = 0, i € [m] 

in general we need to consider the set of points in a low dimensional manifold that is defined by the 
constraints. In particular, in the algorithm after every step we need to project back to this manifold (see 
Algorithm [2] where Flyy is the projection to this manifold). 


Algorithm 2 Projected Noisy Stochastic Gradient 

Require: Stochastic gradient oracle SG(w), initial point wq, desired accuracy n. 
Ensure: wt that is close to some local minimum w*. 

1: Choose 7] = rnin{0(K 2 /log(l/fi:)),p max }, T = 0(l/r/) 

2: for t = 0 to T — 1 do 

3: Sample noise n uniformly from unit sphere. 

4: v t+ i w t - r)(SG(w ) + n) 

5: W t+ 1 = n w (n m ) 






For constrained optimization it is common to consider the Lagrangian: 


C(w,X) = f[w) -^XiCiiw). (7) 

i =1 

Under common regularity conditions, it is possible to compute the value of the Lagrangian multipliers: 

X*(w) = argmin \\\7 w £(w, A)||. 

A 

We can also define the tangent space, which contains all directions that are orthogonal to all the gradients 
of the constraints: T(w) = {n : \7ci(w) T v = 0; i = 1, • • • , m}. In this case the corresponding gradient 
and Hessian we consider are the first-order and second-order partial derivative of Lagrangian C at point 

0,A*0)): 


m 

x{w) = V w C(w, A)|( WiA , (w)) = Vf(w) ~Y X *(w)Vci(w) (8) 

i =1 

m 

Tl(w) = V 2 ww C(w, A)| (w , a . (w)) = V 2 /M - Y X*(w)V 2 Ci (w) (9) 

i —1 

We replace the gradient and Hessian with x( w ) an d 9Ji('«;), and when computing eigenvectors of dJl(w) 
we focus on its projection on the tangent space. In this way, we can get a similar definition for strict 
saddle (see Appendix |B]), and the following theorem. 

Theorem 10. (informal) Under regularity conditions and smoothness conditions, if a constrained opti¬ 
mization problem satisfies strict saddle property, then for a small enough r), in 0(rf 1 log 1 /(f) iterations 
Projected Noisy Gradient Descent (Algorithm |2]) outputs a point w that is 0(^/7/log (l/r/i,')) close to a local 
minimum with probability at least 1 — (. 

Detailed discussions and formal version of this theorem are deferred to Appendix |B] 


4 Online Tensor Decomposition 


In this section we describe how to apply our stochastic gradient descent analysis to tensor decomposition 
problems. We first give a new formulation of tensor decomposition as an optimization problem, and show 
that it satisfies the strict saddle property. Then we explain how_to compute stochastic gradient in a simple 


example of Independent Component Analysis (ICA) ( Hvvarinen et al. . 2004). 


4.1 Optimization problem for tensor decomposition 

*-j4z 

Given a tensor T € R that has an orthogonal decomposition 

T = Ya?\ ( 10 ) 

i=l 

where the components af s are orthonormal vectors (11 a, | = 1, af aj = 0 for i f j), the goal of orthogonal 
tensor decomposition is to find the components af s. 
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This problem has inherent symmetry: for any permutation 7r and any set of € {± 1 }, i € [d], we know 
Ui = Kja.-nfi) is also a valid solution. This symmetry property makes the natural optimization problems 
non-convex. 

In this section we will give a new formulation of orthogonal tensor decomposition as an optimization 
problem, and show that this new problem satisfies the strict saddle property. 

Previously, Frieze et ah (1996) solves the problem of finding one component, with the following objec¬ 
tive function 

max T(u,u,u,u). (11) 

IMI 2 =1 


In Appendix IC.ll as a warm-up example we show this function is indeed strict saddle, and we can apply 
Theorem [TO] to prove global convergence of stochastic gradient descent algorithm. 

It is possible to find all components of a tenso r b y iteratively f inding one component, and do careful 
deflation, as described in lAnandkumar et al.l d2014T) or lArora et al.l (120121) . However , in practice the m ost 
popular approaches l ik e Alternating Least Squares (ComonetaL, 2009 ) or FastICA ( Hvvarinen . 1999h try 
to use a single optimization problem to find all the components. Empirically these algorithms are often more 
robust to noise and model misspecification. 

The most straight-forward formulation of the problem aims to minimize the reconstruction error 


mm 

Wi,\\iii\\ 2 =l 


I r-E 

i= 1 


U7 


|2 

If- 


( 12 ) 


Here || • | /- is the Frobenius norm of the tensor which is equal to the to. norm when we view the tensor as a 
d 4 dimensional vector. However, it is not clear whether this function satisfies the strict saddle property, and 
empirically stochastic gradient descent is unstable for this objective. 

We propose a new objective that aims to minimize the correlation between different components: 

min (ui,Uj,Uj,Uj), (13) 

Vi,|uj|| 2 =l 

To understand this objective intuitively, we first expand vectors Uk in the orthogonal basis formed by {a,;}’s. 
That is, we can write Uk = Yli =l z k(i) a i, where Zk(i) are scalars that correspond to the coordinates in the 
{aj} basis. In this way we can rewrite T(uk,Uk , ui,ui ) = Yli=i( z k(t)) 2 {zi(i)) 2 • From this form it is clear 
that the T(uk,Uk,ui,ui ) is always nonnegative, and is equal to 0 only when the support of Zk and zi do 
not intersect. For the objective function, we know in order for it to be equal to 0 the z 's must have disjoint 
support. Therefore, we claim that {uk}, VA; € [d\ is equivalent to { a, }. Vi € [d] up to permutation and sign 
flips when the global minimum (which is 0) is achieved. 

We further show that this optimization program satisfies the strict saddle property and all its local minima 
in fact achieves global minimum value. The proof is deferred to Appendix 1C. 21 

Theorem 11. The optimization problem M 3D is (a, 7, e, 6)-strict saddle, for a = 1 and 7, e, <5 = 1 /poly(d'). 
Moreover, all its local minima have the form u t = Kid^u) for some K t = ±1 and permutation tr(i). 


4.2 Implementing stochastic gradient oracle 

To design an online algorithm based on objective function (fl3l) . we need to give an implementation for the 
stochastic gradient oracle. 
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In applications, the tensor T is oftentimes the expectation of multilinear operations of samples g(x) over 
x where x is generated from some distribution V. In other words, for any x ~ T>, the tensor is T = E[g(x)]. 
Using the linearity of the multilinear map, we know E [g(x)\(u,i,Ui,Uj,Uj) = E [g(x)(ui,Ui,Uj,Uj)]. There¬ 
fore we can define the loss function f(u, x) = Ylijtj 9 { x ){ u ii u ii u j,Uj), and the stochastic gradient oracle 
SG(u) = V u 4>(u,x). 

For concreteness, we look at a simple ICA example. In the simple setting we consider an unknown 
signal x that is uniforrrdin {±l} rf , and an unknown orthonormal linear transformation ,4 (A A 1 = I). The 
sample we observe is y := Ax € M 7 . Using standard techniques (see Cardosol (1989)), we know the 4-th 
order cumulant of the observed sample is a tensor that has orthogonal decomposition. Here for simplicity 
we don’t define 4-th order cumulant, instead we give the result directly. 


Define tensor Z G 


pd 4 


as follows: 


Z(i, i, i, i ) =3, Vi € [d\ 

Z(i,i,j,j) = Z(i,j,i,j) = Z(i,j,j,i ) = 1, Vi ± j G [d] 


where all other entries of Z are equal to 0. The tensor T can be written as a function of the auxiliary tensor 
Z and multilinear form of the sample y. 

Lemma 12. The expectation E [\{Z — y® 4 )] = Yli=i a f 4 = where afs are columns of the unknown 
orthonormal matrix A. 


This lemma is easy to verify, and is closely related to cumulants ( Cardoso . 1989 ). Recall that ffi. y) 
denotes the loss (objective) function evaluated at sample y for point u. Let 4>(u,y) = t t g(Z — 


y® 4 )( 


ILi , Ui , Uj , Uj j 


By Lemma [121 we know that E [<fi(u,y)\ is equal to the objective function as in 


Equation (fl3l ). Therefore we rewrite objective (fllT) as the following stochastic optimization problem 


mm 

Vi,IKII 2 =l 


E [0(«,y)], where 4>(u,y) = 2^7t( z - y® ){ui,Ui,Uj,Uj) 


The stochastic gradient oracle is then 

V Ui ^>(w,y) = ^2 ([uj,Uj) Ui + 2 (ui,Uj) uj - (Uj,y ) 2 {u i ,y)y s j . (14) 

Notice that computing this stochastic gradient does not require constructing the 4-th order tensor T — y® 4 . 
In particular, this stochastic gradient can be computed very efficiently: 

Remark. The stochastic gradient M 4\) can be computed in 0(d 3 ) time for one sample or Off' + d 2 k) for 
average of k samples. 

Proof The proof is straight forward as the first two terms take 0(d 3 ) and is shared by all samples. The third 
term can be efficiently computed once the inner-products between all the y’s and all the ufs are computed 
(which takes 0(kd 2 ) time). □ 

2 In general ICA the entries of x are independent, non-Gaussian variables. 

3 In general (under-complete) ICA this could be an arbitrary linear transformation, however usually after the “whitening” step 

(see lCardosol jl989t) i the linear transformation becomes orthonormal. 
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5 Experiments 


We run simulations for Projected Noisy Gradient Descent (Algorithm 0) applied to orthogonal tensor de¬ 
composition. The results show that the algorithm converges from random initial points efficiently (as 
predicted by the theorems), and our new formulation (fl3l) performs better than reconstruction error (IT2l) 
based formulation. 

Settings We set dimension d = 10, the input tensor T is a random tensor in M ir|4 that has orthogonal 
decomposition (0. The step size is chosen carefully for respective objective functions. The performance is 
measured by normalized reconstruction error £ = { \T — Yli=\ u ? 4 |If) /II^IIf- 

Samples and stochastic gradients We use two ways to generate samples and compute stochastic gradi¬ 
ents. In the first case we generate sample x by setting it equivalent to d± a t with probability 1/d. It is easy 
to see that E[x® 4 ] = T. This is a very simple way of generating samples, and we use it as a sanity check for 
the objective functions. 

In the second case we consider the ICA example introduced in Section 14.21 and use Equation (fl4l) to 
compute a stochastic gradient. In this case the stochastic gradient has a large variance, so we use mini-batch 
of size 100 to reduce the variance. 

Comparison of objective functions We use the simple way of generating samples for our new objective 
function (fl3T ) and reconstruction error objective (TT2l) . The result is shown in Figure 0 Our new objective 
function is empirically more stable (always converges within 10000 iterations); the reconstruction error do 
not always converge within the same number of iterations and often exhibits long periods with small im¬ 
provement (which is likely to be caused by saddle points that do not have a significant negative eigenvalue). 

Simple ICA example As shown in Figure 0 our new algorithm also works in the ICA setting. When the 
learning rate is constant the error stays at a fixed small value. When we decrease the learning rate the error 
converges to 0. 




(a) New Objective 01 


(b) Reconstruction Error Objective 03 


Figure 1: Comparison of different objective functions 
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(a) Constant Learning Rate r/ (b) Learning Rate iq/t (in log scale) 

Figure 2: ICA setting performance with mini-batch of size 100 


6 Conclusion 

In this paper we identify the strict saddle property and show stochastic gradient descent converges to a local 
minimum under this assumption. This leads to new online algorithm for orthogonal tensor decomposition. 
We hope this is a first step towards understanding stochastic gradient for more classes of non-convex 
functions. We believe strict saddle property can be extended to handle more functions, especially those 
functions that have similar symmetry properties. 
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A Detailed Analysis for Section |3] in Unconstrained Case 


In this section we give detailed analysis for noisy gradient descent, under the assumption that the uncon¬ 
strained problem satisfies (a, 7 , e, d)-strict saddle property. 

The algorithm we investigate in Algorithm [Q we can combine the randomness in the stochastic gradient 
oracle and the artificial noise, and rewrite the update equation in form: 


w t = w t -i - rj(\7f(w t -i) + £ t -i) (15) 

where 77 is step size, £ = SG(wt~ 1 ) — V/(u;/_i) + n (recall n is a random vector on unit sphere) is the 
combination of two source of noise. 

By assumption, we know £’s are independent and they satisfying E£ = 0, ||£|| < Q + 1. Due to the 
explicitly added noise in Algorithm [T] we further have E££ T >- i/. For simplicity, we assume E ££ 7 = a 2 1, 
for some constant a = 0(1), then the algorithm we are running is exactly the same as Stochastic Gradient 
Descent (SGD). Our proof can be very easily extended to the case when ^ I A E[££ r ] A (Q + 2)1 because 
both the upper and lower bounds are 0(1). 

We first restate the main theorem in the context of stochastic gradient descent. 

Theorem 13 (Main Theorem). Suppose a function f(w ) : W l —y M that is (a, 7 , e, 5)-strict saddle, and has 
a stochastic gradient oracle where the noise satisfy E££ 7 = a 2 1. Further, suppose the function is bounded 
by \f(w)\ < B, is (3-smooth and has p-Lipschitz Hessian. Then there exists a threshold p max = 0(1), 
so that for any £ > 0 , and for any r] < 7 ma x/max{l, log(l/£)}, with probability at least 1 — £ in t = 
Oipfy 2 log(l/£)) iterations, SGD outputs a point wt that is \og(\/pQ)-close to some local minimum 

w*. 


Recall that Of) (D, 0) hides the factor that is polynomially dependent on all other parameters, but inde¬ 
pendent of 77 and £. So it focuses on the dependency on 77 and £. Throughout the proof, we interchangeably 
use both TT[w) and V 2 /('«;) to represent the Hessian matrix of /('«;). 

As we discussed in the proof sketch in Section [3} we analyze the behavior of the algorithm in three 
different cases. The first case is when the gradient is large. 


Lemma 14. Under the assumptions of Theorem |7d] for any point with | V/(w.'o) || > y/2po 2 (3d where 
\/2r)o 2 (3d < e, after one iteration we have: 

E/(uq) - f(w 0 ) < -H(t 7 2 ) (16) 

Proof. Choose then by update equation Eq.(fl5T). we have: 


E/(mi) - /(mo) < V/(m 0 ) r E(mi - m 0 ) + ^E||mi - m 0 || 2 

= V/(u,„) T E (-„(Vf(w„) + ?„)) + ||-r,(V/(«,o) + &)|| 

= -('!- ^-)ll v/(«io)ll 2 + "PffyL 

< -|l|V/(a>o)H 2 + < -3FFP- 


( 17 ) 


which finishes the proof. 


□ 
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Lemma 15. Under the assumptions of Theorem I7?l for any initial point wq that is 0(y/rj) < 5 close to a 
local minimum w*, with probability at least 1 — C/2, we have following holds simultaneously: 


Vt< 0(^2 log i), 


\w t — 10*11 < 0{ 


r / lo S^) < <5 


(18) 


where w* is the locally optimal point. 


Proof. We shall construct a supermartingale and use Azuma’s inequality ( Azuma . 1967 1 to prove this result. 
Let filtration fi t = <x{£o: ■ ■ ■ Ct-i}, and note <r{Ao, • • • , A t } C 'fit, where rr{- } denotes the sigma field. 


Let event (£ t = (Vr < t, \\w T — to*|| < p^Jt] log < 5}, where p is independent of ( 77 , C), and will be 
specified later. To ensure the correctness of proof, O notation in this proof will never hide any dependence 
on p. Clearly there’s always a small enough choice of r/ ni . lx = 0(1) to make Pyjp log < 6 holds as long 

as 77 < 77 max /max{l,log(l/C)}. Also note <£ t C <£ t -i, that is lg t < l£ t _ x . 

By Definition|5]of (a, 7, e, A) -strict saddle, we know / is locally a-strongly convex in the 2<5-neighborhood 
of w*. Since V/(m*) = 0, we have 


X7f(w t ) T {w t - w*)l<£ t > a|| w t - w* || 2 l gt 


(19) 


Furthermore, with p max < jf, using /3-smoothness, we have: 


E[|| w t - ^*|| 2 l£ t _ 1 |3 r t-i] =E[||m t _i - p(Vf(w t -i) + 6 - 1 ) - w*|| 2 |3 t _i]l £t _ 1 

= [||m t _i - w *|| 2 - 2rjS7f (wt-\) T {wt-\ - w*) + p 2 \\V f (w t -i)\\ 2 + 77 2 cr 2 ] lg^ 
<[(1 ~ 277 a + rj 2 (3 2 )\\wt-i - w *|| 2 + p 2 o 2 ]l et _ 1 

<[(1 ~ 770 :)- w *|| 2 + ? 7 2 cr 2 ]l (£t _ 1 ( 20 ) 


Therefore, we have: 


E[||m t - m*|| 2 |fo_i] - -1 l e x < (1 - 77a) \\\wt-i - w *\\ 2 - - 
aJ L aJ 


L £t -1 


( 21 ) 


Then, let Gt = (1 — 77 a) t (||7Ut — m*|| 2 — ^), we have: 


EfGtlgt.jA Gt~\ l(£,_i < Gt~ ilc t _ 2 


( 22 ) 


which means Gtl<s t _ 1 is a supermartingale. 
Therefore, with probability 1, we have: 


Let 


l^lgt-i ~~ E[Gtl ( j t _ 1 |5(_i]| 

<(1 - 77a) _t [ IK-1 - ??V/(u7_ 1) - 7n*|| • vUt-i\\ + 7 2 ||Ct-t|| 2 - V 2 v 2 ]!(£*_! 
<(1 - 77 a) _t • 0 (pp 15 log^ A) = dt 

77C 


C t = 


\ 


^2 d r = 0(PP 

T— 1 


1.5 


log: 


vC \ 


^2(1-pa) 


—2 T 


T— 1 


(23) 


(24) 
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By Azuma's inequality, with probability less than 0(q 3 (), we have: 


Gtl<£ t -i > 0(1)ct logs (—) + Gq 


We 


know Gt > 0 ( 1 ) ct logs (^) + Gq is equivalent to: 


|| w t - ru*|| L > 0 (rj) + 0(1)(1 - qayc t logs (—) 

vC 


We know: 


( 25 ) 


( 26 ) 


(1 - Tqafct log2 (^) = p • 6 (q lb log YVl - qa) 2 ^) 


y vC J 


=M • 


0(ri 15 log - ??a) 2r < /i • 0( 7 ? 1 ' 5 log 


= p ■ 0 (q log—) ( 27 ) 


— (1 — 77a) 2 ^ w 0 rj(' 


This means Azuma’s inequality implies, there exist some C = 0(1) so that: 

P (^t-i n j|K - w* || 2 > p ■ Cr] log ~^)|^ - 0(r] 3 C) (28) 

By choosing p > C, this is equivalent to: 

^})<0(t7 3 C) (29) 

Then we have: 

P(£t) = p Ut-i n | IK - w*|| > /i^r/log^j) + P(Ct_r) < 0(i7 3 C) + P(^-t) (30) 



By initialization conditions, we know P((£ 0) = 0 , and thus P(<£t) < tO(q 3 (). Take t = 0 ( 4 *- log |), we 
have P( (Ly ) < 0(rp log 2 ). When q max = 0 ( 1 ) is chosen small enough, and 77 < r/ max / log(l/C), this 
finishes the proof. □ 

Lemmal6. Under the assumptions of Theorem\T3\ for any initial point wq where ||V/(mo)|| < y ,c lr\o 2 fid < 
e, and A m i n ('H(mo)) < — 7 , then there is a number of steps T that depends on w$ such that: 


Ef(w T ) - f(w 0 ) < -12(77) 


( 31 ) 


The number of steps T has a fixed upper bound T max that is independent of wq where T < T max = 
0((log d)/ 777). 

Remark. In general, if we relax the assumption E££ T = a 2 1 to cr 2 n - m I A E££ r A a^axl, the upper bound 
T max of number of steps required in Lemma 1761 would be increased to T max = 0 (^(log d + log ^ ma2L )) 
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As we described in the proof sketch, the main idea is to consider a coupled update sequence that 
correspond to the local second-order approximation of /(x) around wq. We characterize this sequence 
of update in the next lemma. 

Lemma 17. Under the assumptions of Theorem [Til Let f defined as local second-order approximation of 
f{x) around wq: 

/(m) = /(m 0 ) + V/(m 0 ) T (m - m 0 ) + ^(m - w 0 ) T 'H(w 0 ){w - w 0 ) (32) 

{'Wf } be the corresponding sequence generated by running SGD on function f, with wq = wq. For simplicity, 
denote Li = LL(w o) = V 2 /(mo), then we have analytically: 

t -1 

vf(wt) = (1 - vLiYVfiwo) -r,n^2(l- (33) 

r=0 

t-1 t-1 

w t -w 0 = - vLL) T X7f(w 0 ) - “ 7^) t_r_1 £r (34) 

T—0 T=0 


Furthermore, for any initial point wq where ||V/(mo)|| < 0{rj) < e, and A m i n (LL(wo)) = 
there exist a T £ N satisfying: 

— < + 77o) 2t < — 


77o 


T —0 


77o 


with probability at least 1 — 0(r/ 3 ), we have following holds simultaneously for all t < T: 

II w t - m 0 || < log -); ||V/(m t )|| < d(ip log -) 

rj r] 

Proof Denote LL = LL(w o), since / is quadratic, clearly we have: 

Vf(w t ) = Vf(w t - 1 ) + LL{w t - w t ~ i) 

Substitute the update equation of SGD in Eq. (l37l) . we have: 

V/(mt) = V/(mt_i) - vn(Vf(wt-i) + ^t-i) 

= (1 - ri'H)Vf(w t - 1 ) - 

= (1 - i ]LL) 2 V f (w t - 2 ) - rfH-ft-i ~ 7 ^(! - 7 %) 6-2 = • • • 

t—i 


(1 - rjLLfVfiwo) - r/LL 


T—0 


Therefore, we have: 


t -1 

w t ~ m 0 = -77^(V/(m T ) + £ T 

T —0 

t-1 / 


r—1 


= -7E U 1 - 7^) T V/(m 0 ) - p-H J2 (1 - riLiY^'- 1 ^ + £ T 
r=0 V r'=0 

f-1 t-1 

= - puyv f(w 0 ) - - vLiy-^tr 


-70. 

(35) 

(36) 

(37) 


(38) 


(39) 


T —0 


T —0 
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Next, we prove the existence of T in Eq.(l35l). Since W *_ 0 (1 + rj 7o) 2t is monotonically increasing w.r.t 
t, and diverge to infinity as t —>• oo. We know there is always some T € N gives < ^ T ~ 0 * (1 + r/70) 2r . 
Let T be the smallest integer satisfying above equation. By assumption, we know 7 < 70 < L, and 

t +1 t 

+ r no) 2r = 1 + (1 + 77o) 2 X](l + ^ 7 o) 2t (40) 

r=0 T—0 

we can choose 7 / max < min {(a /2 — 1)/L, 2d/y} so that 


T—l 


d \ , Or ^ 2 d ^ 3 d 

- < >J(1 + r/70) <14 -<- 

770 ^ r/70 r/70 


(41) 


Finally, by Eq.(l35T). we know T = OQ.ogd/'yo'q), and (1 4- Vlo) T < 0(1). Also because E£ = 0 and 
||£|| < Q = 0(1) with probability 1, then by Hoeffding inequality, we have for each dimension i and time 
t < T : 

p T/'ft)* _r_1 £ rji | > 0 ( 7/5 log i) J < e _ ^ (log ^ < 0(r/ 4 ) (42) 

V r=0 


v 


then by summing over dimension d and taking union bound over all t < T, we directly have: 
P < T, ||r/^(l - ?7 'H)^ t ~ 1 Ct|| > 0(7 /5 logi)j < 0 (r/ 3 ). 


Combine this fact with Eq. (l38l) and Eq.([39l). we finish the proof. 


(43) 


□ 


Next we need to prove that the two sequences of updates are always close. 

Lemma 18. Under the assumptions of Theorem [73] and let { u;/} he the corresponding sequence generated 
by running SGD on function f. Also let f and {u>t} be defined as in Lemma\T7\ Then, for any initial point 
w 0 where ||V/(ioo)|| < 0(r/) < e, and A m i n (V 2 /(u>o)) = — 70 - Given the choice ofT as in Ea./[35]). with 
probability at least 1 — 0(r/ 2 ), we have following holds simultaneously for all t < T: 

IK - W t \\ < 0(7/log 2 -); ||V/K) - V/K)|| < 0(r/log 2 i) (44) 

T) r\ 

Proof. First, we have update function of gradient by: 

V/K) =V/K_i) + f 'H{w t -1 + t(w t - w t -i))dt ■ (w t - w t -i) 

Jo 

=V/(iw t -i) 4- - w t - 1 ) 4- O t -1 (45) 


where the remainder: 


%-i = / [KKi + t(w t - w t _i)) - KKi)] df • K - w f _i) 


(46) 
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Denote T~L = 'H(wq ), and = T~L{wt- 1 ) — 'H(tuo). By Hessian smoothness, we immediately have: 

Wt-iW = ll'H(wt-i) - %(m 0 )|| < p\\w t -i - m 0 || < p(\\w t - w t \\ + \\w t - m 0 ||) (47) 

||^-i|| < \w t - w t - 1 || 2 (48) 

Substitute the update equation of SGD (Eq. (fT5l) ) into Eq.(l45l). we have: 

v/M = v/K_i) - pin + wJ-Otv/K-O + &-0 + 

= (1 - vM)Vf(w t -i) - r?^6-t - ^-i(V/K- 1 ) + 6-t) + 0t-t (49) 

Let At = Vf(wt) — Vf(wt) denote the difference in gradient, then from Eq. (l38l) . Eq.(l49l). and Eq.(fl5T). 
we have: 


At = (1 - pU) At_i - pU'^i&t-i + V/(mt_i) + 6_i] + 0t_i (50) 

t-1 

w t - w t =-p^Ar (51) 

t=0 


Let filtration = cr{^ 0 , ■ • • £t-i}> and note <r{Ao,--- , Af} C Su where <r{-} denotes the sigma 
field. Also, let event = {Vr < t, ||V/(m r )|| < 0{p^ log |), ||m T — mo|| < 0(772 log |)}, and 
£t = {Vt < t, || A^|| < ftp log 2 4}, where // is independent of (//. £), and will be specified later. Again, O 
notation in this proof will never hide any dependence on //. Clearly, we have A/ c A/ _ i ((£/ C Gi/_i). thus 
l& t < (l(£ t £ l Ct _i), where 1& is the indicator function of event A. 

We first need to carefully bounded all terms in Eq.(l50l). conditioned on event At_i n <£t_ 1 , by Eq.(l47l). 
Eq.(l48l)). and Eq. (l5TT) . with probability 1, for all t < T < 0(log d/pop), we have: 

||(l-?/W)A t _i|| < 0(pp log 2 -) ||^t_i(At_i + V/(mt-i))|| <0(p 2 log 2 -) 

T] T] 

WpU't^t-tW < dip 1 ' 5 log —) || 6 »t-i|| < d{p 2 ) (52) 

p 

Since event A<_i C St-i, &t- 1 C thus independent of £t_i, we also have: 

E[((l - r/H)A t _i) T p'Hj_ 1 6 -iljt t _inc t _i I St- 1 ] 

=l^_ 1 ne t _ 1 ((l ~ pniAt^fpn't-tElZt-i | &_i] = 0 (53) 


Therefore, from Eq.(l50l) and Eq. 

^[Il^tlli^-inct-i I 'St-1 ] 

< 


(1 + 777o) 2 ||Ai_i|| 2 + (1 + r?7o)||A t _ 1 ||0(r7 2 log 2 -) + O^ 1 log 2 -) 

Tj T] 




< 


(1 + fi 7 o) 2 ||A i _i|| 2 + Oipp 3 log 4 -) 


1 , 


itt-lHfi-i 


Define 


G t = (1 + ppo) 2t [ \\A t \\ 2 + ap 2 log' 1 i ] 


(54) 


(55) 
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Then, when r/ m;ix is small enough, we have: 


| 3/-i] — (1 + Vlo) 


- 2 1 


E [||At|| 2 la t _in<£ t _i I fo-i] + log 3 - 




<(l + ^?7o) 2t (1 + ? 77 o) 2 ||A f _i|| 2 + 0(/xr/ 3 log 4 —) + arj 2 log 4 — 

r] rj 




<(1 + T/o) 


-it 


(1 + ? 77 o) || A f _i || 2 + (1 + rpto) ccrf log 4 - 


£ t _in£t_i 


—< G7.-1 l^ t _ 2 ne t _ 


( 56 ) 


Therefore, we have E[Gtl^ t _ 1 ne t _ 1 | 3x~i] < G/_] l£ t _ 2 ne t _ 2 which means Gtl^ t _ 1 ne t _ 1 is a supermartin¬ 
gale. 

On the other hand, we have: 


A* = (1 - i]H) A f _i - rjT^At-i + V/(m t _ 1)) - + 0 t -t ( 57 ) 


Once conditional on filtration 5t- 1 , the first two terms are deterministic, and only the third and fourth term 
are random. Therefore, we know, with probability 1: 

| ||At|||-E[||Atfe_i] ll^ne^ < 6(/xr/ 2 5 log 3 ^) ( 58 ) 

Where the main contribution comes from the product of the first term and third term. Then, with probability 
1 , we have: 


\Gtlftt_inZt-i ~ E[G t l^ t _ in e t _i | 3/-i]| 

i2 ttt< r 11 a 11 2 


1 


=(1 + 2 r/ 7o ) _2t 

By Azuma-Hoeffding inequality, with probability less than 0(r/ 3 ), for t < T < 0 (logd/ 7077 ): 


A<||2 - E[||At|| 2 | 5 t-i] | • < 0(/xt/ 3 log 3 -) = c t _i 


— Go • 1 > 0(1) 


This means there exist some C = 0(1) so that: 


\ 


5 ^ 4 log(-) = 0 (/x ?/ 2 log 4 -) 

v V 


P ( G t > C'W log - ) < 0(fi ) 


By choosing ji > C, this is equivalent to: 


P ( Si t - 1 n £ t -1 n { \\A t \\ 2 > /x 2 t / 2 log 4 ■ 


<o(v 3 


Therefore, combined with Lemma [171 we have: 

P ^£ t _i n 1 1| Af || > /XT/ log 2 i 

=P 1 n £ t - 1 n 11|A*|| > /x?/log 2 + P (sit -1 n <£ t -i n |||A t || > /t 7 /log 2 i 

<0(T/ 3 ) + P(I t _!) <0(r/ 3 ) 


( 59 ) 


( 60 ) 


(61) 


( 62 ) 


(63) 
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Finally, we know: 


P((£ t ) = P^(£i_ 1 n|||A t || > m log 2 ij) +P(£ t - l )<d{rf) + P(£ t -i) (64) 

Because P(£o) = 0, and T < 0(|), we have P(£t) < 0(p 2 ). Due to Eq.d5lT>. we have \\wt — wt\\ < 
r) || A r ||, then by the definition of <£t, we finish the proof. 

□ 


Using the two lemmas above we are ready to prove Lemma fl 6 l 

Proof of Lemma 1761 Let / and { w t } be defined as in Lemma [171 and also let A rnm (7f (u’q)) = — 70 . Since 
PL(w) is p-Lipschitz, for any w,w 0 , we have: 


f(w) < f(w 0 ) + V/(m 0 ) T (m - w 0 ) + l(m - w 0 ) T Pi{w 0 )(w - w 0 ) + ^|| w- m 0 || 3 


(65) 


Denote 6 = wt — mo and 5 = wt — wt , we have: 
f(w T ) - /(m 0 ) < 


V/(mo) T (m T - wo) + 1(wt - w 0 ) T Pi(w 0 )(w T - m 0 ) + ^||m T - m 0 || 3 


V/(m 0 ) r (5 + 5) + 1(5 + 5) t U(5 + 5) + ^\\6 + 5 || 3 


V/(m 0 ) T <5 + \5 t U5 


+ 


V/(m 0 ) T 5 + FU8 + U T Pi5 + £||J + 5 || 3 
2 fa 


( 66 ) 


Where Pi = PL(wq). Denote A = V f{wo) T 5 + \5 T Pi5 be the first term, and A = V/(m o) r 5 + 5 T PL8 + 
\5 T PL8 + |||5 + 5|| 3 be the second term. We have f(wr) — f(wo) < A + A. 

Let <£t = {Vt < t. || w T — mo|| < 0(rj 2 log -), ||mt — iZ>t || < 0(rj log 2 ^)}, by the result of Lemma [171 


and LemmaQjO we know P(<£t) > 1 — 0(rj 2 ). Then, clearly, we have: 


v J 


E f{w T ) - f{w 0 ) =E [f(w T ) - f(w 0 )]l<s T + E[/(m r ) - /Oo)]1 £t 
<EA1 £t + EA1 Ct + E [f(w T ) ~ f(w 0 )] lg T 
=EA + EA 1 £t + E[/(mr) - f(w 0 )] lg T - EAlg T (67) 

We will carefully caculate EA term first, and then bound remaining term as “perturbation” to first term. 

Let Ai, ■ ■ ■ . X,i be the eigenvalues of Pi. By the result of lemma IT71 and simple linear algebra, we have: 


d 2T-1 


d T—l 


E ^ = ~?E E ( 1 -7Ai) r |V J /(mo)| 2 + l^Ai^(l-7 ? A*) 2r 7 ? 2 a 


1=1 T—0 
d T—l 


i=l r=D 


< 9 E Xi E^ 1 ~ 7Ai) 2r fi 2 ^ 2 


1=1 T —0 


< 


rfa 2 


d- 1 


T—l 


70 


E( 1 + 77o) 2r 


t=0 


< - 


rjcr 


( 68 ) 
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The last inequality is directly implied by the choice of T as in Eq. (l35l) . Also, by Eq.(l35l). we also immedi¬ 
ately have that T = 0(logd/ 707 ) < 0(logd/ 777 ). Therefore, by choose T max = 0{\ogd/^rj) with large 
enough constant, we have T < T max = 0(log d/ 777 ). 

For bounding the second term, by definition of (£ t , we have: 


EA1 £t = E 


Vf(w 0 ) T d + ~8 T U8 + l -8 T U5 + £|| 6 + 5 || 3 
2 fa 


1 e T < 0(r] 15 log 3 -) 


(69) 


On the other hand, since noise is bounded as ||£|| < 0(1), from the results of Lemma [171 it’s easy to 
show ||u) — wq | = 11A11 < 0(1) is also bounded with probability 1. Recall the assumption that function / is 
also bounded, then we have: 


E[/(wr) - /(wo)]lg T - EAlg T 
=E[/(m T )-/(mo)]%-E ' r 


1 


\7f(w 0 ) T 5 + -5 t U8 


% < d(l)P(<£ T ) < 0(rj 2 


Finally, substitute Eo.(l 68 l). Eo.(l69l) and Eq.dTOl) into Eo.dfaTT). we finish the proof. 


(70) 

□ 


Finally, we combine three cases to prove the main theorem. 


Proof of Theorem 17?] Let’s set C\ = {w | ||V/(u;)|| > yjl rj(r 2 f5d}, £2 = {w | ||V/(m)|| < \J2rjo 2 /3d 
and A m i n (7f (w)) < —7}, and £3 = £f U£§. By choosing small enough r] mSLX , we could make \J2r]a 2 (3d < 
min{e, or) }. Under this choice, we know from Definition [5] of (a, 7 , e, <5)-strict saddlethat £3 is the locally 
o-strongly convex region which is 0{y/rj) -close to some local minimum. 

We shall first prove that within 0( 4j- log |) steps with probability at least 1 — C/2 one of wt is in £3. 
Then by Lemma[[5]we know with probability at most Q/2 there exists a wt that is in £3 but the last point is 
not. By union bound we will get the main result. 

To prove within 0(^7 log ^) steps with probability at least 1 — C/2 one of wt is in £ 3 , we first show 
starting from any point, in O(^jz) steps with probability at least 1/2 one of w t is in £3. Then we can repeat 
this log 1/C times to get the high probability result. 

Define stochastic process {- 7 } s.t. tq = 0, and 


_ I n + 1 if w Ti € £1 U £3 

1+1 \r t + T(w n ) if w Ti € £ 2 

Where T(w n ) is defined by Eq.(l35T) with 70 = A nnri (7f ('uvj jand we know T < T max = 0(|). 
By Lemma |T4] and Lemma fl 6 l we know: 

E[/(m ri+1 ) - f(w Ti )\w Ti € £i,d n -i] = E[/ (w n+1 ) - f(w Ti )\w Ti € £ 1 ] < ~0{r] 2 ) 
E[f(w Ti+1 ) - f{w n )\w Ti € £ 2 ,^- 1 ] = E[/(u; T . +1 ) - f(w n )\w Ti € £ 2 ] < - 0 ( 7 ) 


(71) 


(72) 

(73) 


Therefore, combine above equation, we have: 

IE [f{w n+1 ) - f(w Ti )\w Ti (jL £ 3 ,^- 1 ] = E [f(w Ti+1 ) - f{w Ti )\w n £ £ 3 ] < -( 7+1 - Ti)0(r] 2 ) (74) 

Define event = {3j < i, w Tj € £ 3 }, clearly (£,; C i : /+ i, thus £((£,) < £((£,;+i). Finally, consider 
/(^r i+1 )l7> We have: 


IE/(m Ti+1 )l £i - E/(uvjlei-i. < 5 • P(£* 

< 5 • P(Ci 


Ci_i) + E[/(U7 T . +1 ) - /(tw Ti )|<Si] • T 3 }^) 

<£i-t) - (A+i - Ti)d{rj 2 )P(^i) 


(75) 
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Therefore, by summing up over i, we have: 

E/(m ri )l Ci - /(m 0 ) < BP(£ t ) - <B- nO^P^ (76) 

Since \ f{w Ti )lgj < B is bounded, as r* grows to as large as we must have P((£j) < That is, after 
O(^) steps, with at least probability 1/2, {'«;/} have at least enter £3 once. Since this argument holds for 
any starting point, we can repeat this log 1 // times and we know after 0( ^i log 1 //) steps, with probability 
at least 1 — C/ 2 , {wt} have at least enter £3 once. 

Combining with Lemma[l5l and by union bound we know after 0(/4 log 1/C) steps, with probability at 
least 1 — C, wt will be in the 0 (. /// log ^) neigborhood of some local minimum. □ 
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B Detailed Analysis for Section [3] in Constrained Case 


So far, we have been discussed all about unconstrained problem. In this section we extend our result to 
equality constraint problems under some mild conditions. 

Consider the equality constrained optimization problem: 

min /(m) (77) 

W 

s.t. Ci(w ) =0, i = 1 , • • • , m 

Define the feasible set as the set of points that satisfy all the constraints W = {w \ Ci(w) = 0; i = 

In this case, the algorithm we are running is Projected Noisy Gradient Descent. Let function Ilw(n) to be 
the projection to the feasible set where the projection is defined as the global solution of min,„ G yy | v — w 11 2 . 

With same argument as in the unconstrained case, we could slightly simplify and convert it to standard 
projected stochastic gradient descent (PSGD) with update equation: 


v t = w t - 1 - rjS7f{wt~i) + it- 1 (78) 

w t = n w (n t ) (79) 

As in unconstrained case, we are interested in noise £ is i.i.d satisfying E£ = 0, E££ 7 = cr 2 1 and ||£|| < Q 
almost surely. Our proof can be easily extended to Algorithm [2] with I A E££ t A (<5 + ^)I. In this section 
we first introduce basic tools for handling constrained optimization problems (most these materials can be 
found in lWright and Nocedal (1999)), then we prove some technical lemmas that are useful for dealing with 
the projection step in PSGD, finally we point out how to modify the previous analysis. 


B.l Preliminaries 

Often for constrained optimization problems we want the constraints to satisfy some regularity conditions. 
LICQ (linear independent constraint quantification) is a common assumption in this context. 

Definition 19 (LICQ). In equality-constraint problem Eq. dTTl) . given a point w, we say that the linear 
independence constraint qualification (LICQ) holds if the set of constraint gradients {Vcj (a;), i = 1, • • • ,m} 
is linearly independent. 

In constrained optimization, we can locally transform it to an unconstrained problem by introducing 
Lagrangian multipliers. The Langrangian C can be written as 

m 

£(w, A) = f{w) - ^2 XiCiiw) (80) 

i— 1 

Then, if LICQ holds for all w € W, we can properly define function A*(-) to be: 

m 

A*(in) = argmin \\Wf(w) — AjVcj(m)|| = argmin ||V TO £(w;, A)|| (81) 

A A 

i= 1 

where A*(-) can be calculated analytically: let matrix C(w) = (Vci(tu), • • • , Vc m (w)), then we have: 

\*{w) = C{w)^Vf(w) = (C(m) T C'(m))- 1 C'(m) T V/(u;) (82) 

where (-)"*" is Moore-Penrose pseudo-inverse. 

In our setting we need a stronger regularity condition which we call robust LICQ (RLICQ). 
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Definition 20 ( a, -RLICQ ). In equality-constraint problem Eq. dTTI) . given a point w, we say that a c -robust 
linear independence constraint qualification ( a c -RLICQ ) holds if the minimum singular value of matrix 

C(w) = (Vci(m), • • • , Vc m (m)) is greater or equal to a c , that is a m ^ (C(w)) > a c . 

Remark. Given a point w € VV, a c -RLICQ implies LICQ. While LICQ holds for all w <G VV is a necessary 
condition for X* (w) to be well-defined; it’s easy to check that a c -RLICQ holds for all w £ VV is a necessary 
condition for A* (tv) to be bounded. Later, we will also see a c -RLlCQ combined with the smoothness of 
{c t (w) ] guarantee the curvature of constraint manifold to be bounded everywhere. 

Note that we require this condition in order to provide a quantitative bound, without this assumption 
there can be cases that are exponentially close to a function that does not satisfy LICQ. 

We can also write down the first-order and second-order partial derivative of Lagrangian £ at point 


x(w) = V w £{w,A)\( w ,\*( w )) = V/(m) - (m)Vci(m) (83) 

i =1 

m 

VJl(w) = V 2 ww £(w, A)\ {w ^ {w)) = V 2 /M - A *(w)V 2 Ci (w) (84) 

i=l 

Definition 21 (Tangent Space and Normal Space). Given a feasible point w € W, define its corresponding 
Tangent Space to be T{w) = {f | Vci(w) T v = 0; i = 1, • • • ,m}, and Normal Space to be T c {w) = 
span{Vci(m) • • • ,Vc m (rn)} 

If w € 1Z d , and we have rn constraint satisfying a c -RLICQ , the tangent space would be a linear 
subspace with dimension d — m\ and the normal space would be a linear subspace with dimension rn. 
We also know immediately that x( w ) defined in Eq.(l83l) has another interpretation: it’s the component of 
gradient V/(m) in tangent space. 

Also, it’s easy to see the normal space T c (w) is the orthogonal complement of T. We can also define 
the projection matrix of any vector onto tangent space (or normal space) to be Pp( w ) (or Pp<y w \). Then, 
clearly, both Pt(w) an< i Pt c (w) are orthoprojector, thus symmetric. Also by Pythagorean theorem, we have: 

IMI 2 = II-Pth^II 2 + ll- p r c H ,y ll 2 ) V v e (85) 

Taylor Expansion Let w, wq £ W, and fix A* = A*(w’o) independent of w, assume SJ 2 1V] C{w. A* ) is 
p^-Lipschitz, that is ||V^£(u;i, A*) — X7^ w £(w2, A*)|| < pl\\wi — By Taylor expansion, we have: 

C(w, A*) <C(w 0 , A*) + X7 w £(w 0 , A*) T (w - w 0 ) 

+ -(w - iu 0 ) t V^ w £(w 0 , A*)(w - w 0 ) + ^r\\w- m 0 || 3 (86) 

2 b 

Since w, wq are feasible, we know: £(w, A*) = f(w) and £(wo, A*) = f(w o), this gives: 

f(w) < f(w 0 ) + x( w o) T (w - w 0 ) + - m 0 ) T 2)t(ruo)(ru - w 0 ) + j~\\w- m 0 || 3 (87) 
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Derivative of x( w ) By taking derative of x( w ) again, we know the change of this tangent gradient can be 
characterized by: 


VxM A *HV 2 qH - ]T \7 Ci(w)\7\* (w) T (88) 

i =1 2—1 


Denote 

m 

9T(m) ^Ci(w)VX*{w) T (89) 

2=1 

We immediately know that Vy(m) = 9Jt(m) + 'Tt(m). 

Remark. The additional term 91 (m) is not necessary to be even symmetric in general. This is due to the 
fact that xi w ) ma y not be the gradient of any scalar function. However, 92 (m) has an important property 
that is: for any vector v € 92(m)n € T c (w). 


Finally, for complete ness, we st ate here the first/second-order necessary (or sufficient) conditions for 
optimality. Please refer to Wright and Nocedal ( 19991) for the proof of those theorems. 


Theorem 22 (First-Order Necessary Conditions). In equality constraint problem Eq. tf7Zl ). suppose that id 
is a local solution, and that the functions f and C{ are continuously differentiable, and that the LICQ holds 
at ml Then there is a Lagrange multiplier vector Al such that: 


X7 w £(w\X')= 0 

Cj(m^) = 0, for i = 1, • • • , m 


(90) 

(91) 


These conditions are also usually referred as Karush-Kuhn-Tucker (KKT) conditions. 

Theorem 23 (Second-Order Necessary Conditions). In equality constraint problem Ea. ff771). suppose that 
tif is a local solution, and that the LICQ holds at ml Let A' Lagrange multiplier vector for which the KKT 
conditions are satisfied. Then: 


n T V^ x £(m^, A^)n > 0 for all v G T (ul) (92) 

Theorem 24 (Second-Order Sufficient Conditions). In equality constraint problem Ea. if77l). suppose that 
for some feasible point uT € R d , and there’s Lagrange multiplier vector A' for which the KKT conditions 
are satisfied. Suppose also that: 

v T 'V^ cx C(w^, > 0 for all v £ T(w^),v 0 (93) 

Then vfi is a strict local solution. 

Remark. By definition Ea.\82 D. we know immediately A* (id) is one of valid Lagrange multipliers A 4 for 
which the KKT conditions are satisfied. This means x(wl) = X7 w C(w\ A^) and 9JT(td) = C{w\ A^). 

Therefore, Theorem [221E3 [24] gives strong implication that x( w ) an d 9Ji(m) are the right thing to look 
at, which are in some sense equivalent to V/(m) and V 2 /(m) in unconstrained case. 
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B.2 Geometrical Lemmas Regarding Constraint Manifold 

Since in equality constraint problem, at each step of PSGD, we are effectively considering the local manifold 
around feasible point wt- 1 . In this section, we provide some technical lemmas relating to the geometry of 
constraint manifold in preparsion for the proof of main theorem in equality constraint case. 

We first show if two points are close, then the projection in the normal space is much smaller than the 
projection in the tangent space. 


Lemma 25. Suppose the constraints {c,} "1 1 are pi -smootli, and a c -RLICQ holds for all w € W. Then, let 
YT=i % = jp, for any m,m 0 G W, let To = T{w 0 ), then 


\\P To pw - m 0 )|| < — ||m - m 0 || 2 

Furthermore, if\\w — mo|| < R holds, we additionally have: 

\\P To (w - w 0 )\\ 2 


\\P T c(w - VJ 0 )\\ < 


R 


(94) 


(95) 


Proof. First, since for any vector v € To, we have ||C'(mo) T fi|| = 0, then by simple linear algebra, it’s easy 
to show: 


\\C(w 0 ) T (w - m 0 )|| 2 =\\C (w 0 ) T P T -(w - m 0 )|| 2 > p T 0 c {w - m 0 )|| 2 

>u 2 c\\ p t(w ~ wo)\? (96) 

On the other hand, by ;3,;-smooth, we have: 


cfw) - Ci(w 0 ) - Vcj(m 0 ) T O 


^o)l < y \\w - m 0 || 2 


Since m, mo are feasible points, we have cfw) = Cj(mo) = 0, which gives: 


(97) 


m 171 o2 

||C'(m 0 ) T (m - m 0 )|| 2 = ^(Vci(m 0 ) T (m - m 0 )) 2 w o|| 4 (98) 

1=1 1=1 

Combining Eq.(l96l) and Eq.(| 98T). and the definition of R, we have: 

\\P To pw - w 0 )\\ 2 < Ik-m 0 || 4 = ^ 2 (\\Pt 0 c (w - w 0 )\\ 2 + \\P To {w - w 0 )\\ 2 ) 2 (99) 

Solving this second-order inequality gives two solution 

||P To c(m - m 0 )|| < H Pr o^ or ||P To c(u; - «; 0 )|| > R (100) 

By assumption, we know ||m — mo|| < R (so the second case cannot be true), which finishes the proof. □ 

Here, we see the \JYfiLi % = Jt serves as a upper bound of the curvatures on the constraint manifold, 

and equivalently, R serves as a lower bound of the radius of curvature. o c -RLICQ and smoothness 
guarantee that the curvature is bounded. 

Next we show the normal/tangent space of nearby points are close. 
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Lemma 26. Suppose the constraints {e, } ■” , are fa-smooth , and a c -RLICQ holds for all w G W. Lei 

g2 

EI=i ^ = -ki,for any w, Wq G W, /ct 7o = T(wq). Then for all v G T(m) so t/zcz? ||t)|| = 1, we /zave 

a c 


11^7? ’ #11 < 


\w — Wo\ 

R 


Proof. With similar calculation as Eg. (1961) . we immediately have: 


\\ P Tf • v \\ 2 < 


T A112 


Wf< \\c(w 0 yv 

a lin( C ( W )) 


at 


( 101 ) 


( 102 ) 


Since v G T(w) , we have C{w) T v = 0, combined with the fact that v is a unit vector, we have: 


||C(m 0 ) r i)|| 2 =|| [C(wq) - C(w)] T v || 2 = ^([Vci(m 0 ) - Vc i (w)] T z5) 2 

i =1 

m m 

-^2 l|Vci(^°) - Vc i(^)l| 2 ||^H 2 < £/? 2 |K - m|| 2 (103) 

2—1 2 — 1 

Combining Eg. (1 1021 ) and Eg . (11 031) . and the definition of R, we concludes the proof. □ 

Lemma 27. Suppose the constraints {e, }”1 t ore Pi-smooth, and a c -RLICQ holds for all w G W. Let 

o2 

ES=i ^ = -pj,for any w, wo G VV, /ct 7o = T(wq). Then for all v € T c (w ) so f/zaf ||D|| = 1, we /zave 

a c 


II^V#II< 


|UZ — ZCol 

7? 


(104) 


Proof By definition of projection, clearly, we have Pp 0 ■ v + Pj-c ■ v = v. Since v G T c (w), without loss 
of generality, assume v = E£i AjVcj(m). Define d = E"=Li AjVcj(mo), clearly d G 7j) C - Since projection 
gives the closest point in subspace, we have: 


\\ p T 0 ■ #11 =\\Pt 0 c - v-v\\< \\d-v\\ 

m m 

<E^llVciW ~ Vcj(m) II < ^ AiAlluzo - uz|| ( 105 ) 

2 — 1 2—1 


On the other hand, let A = (Ai, • • • , X m ) T , we know C(w )A = v, thus: 

A = C'(if7) t f) = (C(m) T C(zz;))- 1 C( m) T n (106) 

Therefore, by a c -RLICQ and the fact v is unit vector, we know: ||A|| < Combined with Eq. (1105l) . we 
finished the proof. □ 

Using the previous lemmas, we can then prove that: starting from any point wo on constraint manifold, 
the result of adding any small vector v and then projected back to feasible set, is not very different from the 
result of adding Pp( WQ \v. 


30 












Lemma 28. Suppose the constraints {q }^ 1 are Pi-smooth, and a c -RLICQ holds for all w G W. Let 

o2 

Y1T= i ^ = -m-for any wq € W, let To = T(wq). Then let w\ = Wq + rjv, and w 2 = Wq + d^To ' A where 

Ot c -TL 

v € S " 1 1 is a unit vector. Then, we have: 

4 ? 7 ^ 

||n w («;i) - w 2 || < — (107) 

H 

Where projection 1 lw(w) is defined as the closet point to w on feasible set W. 

Proof. First, note that ||w;i — u>o|| = r/, and by definition of projection, there must exist a project IIy\;(w) 
inside the ball B^wi) = {w | ||w — mi|| < ij}. 

Denote u\ = llvv(' u; i), and clearly u\ £ W. we can formulate u\ as the solution to following 
constrained optimization problems: 

min ||u>i — u\\ 2 (108) 

U 

s.t. cfiu) = 0, i = 1 , • • • , m 

Since function f(u) = ||u>i — w || 2 and cfiu) are continuously differentiable by assumption, and the condition 
a c -RLICQ holds for all w <G W implies that LICQ holds for u\. Therefore, by Karush-Kuhn-Tucker 
necessary conditions, we immediately know (w\ — u \) €= T{u\ ). 

Since u\ £ B 7 ? (mi), we know ||mo — wi|| < 2??, by Lemma l27l we immediately have: 

\\Pr 0 {wi ~ tti)|| = - ni|| < L|w 0 - tti|| • \\wi - tti|| < (109) 

||wi — wi|| K K 

Let v\ = wq + Pr 0 ( u i ~ w o )> we have: 

N - w 2 || =||(wi - w 0 ) - (w 2 - w; 0 )|| = ||Pto(«i - w 0 ) - TV 0 (wi - w 0 )|| 

=\\ p T 0 (m -«t)|| < 

On the other hand by Lemma [25l we have: 

1 2 

IN - tfill = \\Pt 0 c (ui - w 0 )II < —IN - w 0 || 2 < — r/ 2 

Combining Eq. dl 101) and Eq. dl 1 II) . we finished the proof. 


(HO) 


(HI) 


□ 


B.3 Main Theorem 

Now we are ready to prove the main theorems. First we revise the definition of strict saddle in the constrained 
case. 

Definition 29. A twice differentiable function f(w) with constraints efiw) is (cr, 7 , e, 5)-strict saddle, if for 
any point w one of the following is true 

1 - l|x(w)|| > e. 

2 . v T Wl(w)v < —7 for some v £ T(w), ||i)|| = 1 
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3. There is a local mi nimum w * such that |w — u;*|| < <5, and for all w' in the 2 6 neighborhood of w*, 
we have v T $Jl(w')v > a for all v € T(w'), ||D|| = 1 

Next, we prove a equivalent formulation for PSGD. 

Lemma 30. Suppose the constraints { c t } | are Pi-smooth, and a c -RLICQ holds for all w € W. Further¬ 

more, if function f is L-Lipschitz, and the noise £ is bounded, then running PSGD as in Ea.\78 1) is equivalent 
to running: 

w t = w t -1 - 7 ? • {x{w t -1 ) + Priwt-Git-i) + it-1 (112) 

where l is the correction for projection, and ||r|| < Off 2 ). 

Proof Lemma l30l is a direct corollary of Lemma l28l □ 

The intuition behind this lemma is that: when {cfy/L 1 are smooth and a c -RLICQ holds for all w € W, 
then the constraint manifold has bounded curvature every where. Then, if we only care about first order 
behavior, it’s well-approximated by the local dynamic in tangent plane, up to some second-order correction. 

Therefore, by Eq. d 1 1 21) . we see locally it’s not much different from the unconstrainted case Eq. (fT5T) up to 
some negeligable correction. In the following analysis, we will always use formula Eq. d 1 121) as the update 
equation for PSGD. 

Since most of following proof bears a lot similarity as in unconstrained case, we only pointed out the 
essential steps in our following proof. 

Theorem 31 (Main Theorem for Equality-Constrained Case). Suppose a function f(w) : M. d —>• R with 
constraints cpw) : —>• K is (a, 7 , e, 5)-strict saddle, and has a stochastic gradient oracle with radius at 

most Q, also satisfying E£ = 0 and E ££ 7 = a 2 1. Further, suppose the function function f is B-bounded, 
L-Lipschitz, /3-smooth, and has p-Lipschitz Hessian, and the constraints {c, }"!, is Li-Lipschitz, Pi-smooth, 
and has pi-Lipschitz Hessian. Then there exists a threshold rj max = 0(1), so that for any ( > 0, and for 
any 77 < rj max / max{l, log(l/£)}, with probability at least 1 — ( in t = 0{r)~ 2 log(l/£)) iterations, PSGD 
outputs a point wt that is 0( 77 l°g(l/ 77 C)) -close to some local minimum w*. 

First, we proof the assumptions in main theorem implies the smoothness conditions for 9Jt(w), TI(vj) 
and V^ l)W C(w, X*(w')). 

Lemma 32. Under the assumptions of Theorem [771 there exists Pm, Pn, Pm, Pn ■ Pi. polynomial related to 
B, L, P, p, ^ and {T,;, A, P*}™ 1 so that: 

1. ||9Jt(w;)|| < Pm and ||Tt(u;)|| < Pn far all w € VV. 

2. is pM-Lipschitz, and yi(w) is pN-Lipschitz, and S7‘f w C{w,\*(w')) is pL-Lipschitz far all 
w' € W. 

Proof. By definition of 9Jl(w), 01 (w) and S7 2 ni ,F(w, A* ('«/)), the above conditions will holds if there exists 
B \, L\,px bounded by 0(1), so that A *(w) is /i>-boundcd, L^-Lipschitz, and Ay-smooth. 

By definition Eq. (l82l) . we have: 

X*(w) = C(w) f Vf(w ) = {C{w) T C{w))~ 1 C{w) T \7f{w) (113) 

Because / is /i-boundcd, L-Lipschitz, /3-smooth, and its Hessian is p-Lipschitz, thus, eventually, we only 
need to prove that there exists B c , L c , /3 C bounded by 0(1), so that the pseudo-inverse C(wf is /i,-boundcd, 
L c -Lipschitz, and A-smootli. 
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Since a c -RLICQ holds for all feasible points, we immediately have: |C(u')^ I < thus bounded. For 
simplicity, in the following context we use C 1 to represent C'(w) without ambiguity. By some calculation 
of linear algebra, we have the derivative of pseudo-inverse: 


dC(w ) f 
dwi 


c t dC(w) 
dwi 


Cft + Cft[Cft} T 


dC(w) T 

dwi 


(I - Ctf) 


(114) 


Again, q, -RLICQ holds implies that derivative of pseudo-inverse is well-defined for every feasible point. 
Let tensor E(w), E{w) to be the derivative of C(w), C 1 (w), which is defined as: 


[E^w^ijk — 


d[C(w)] ik 

dvjj 


[E(w)]ijk — 


d[C{wft] ik 


dwj 


Define the transpose of a 3rd order tensor Ef- k = E^j^, then we have 

E(w) = ~[E(w)](Cft,I,Cft) + [£(m) T ](Ct[Ct] T ,I, (J - CCft)) 


(115) 


(116) 


where by calculation [E(w)](I, /, ef) = V 2 Cj(m). 

Finally, since C(wft and V 2 are bounded by 0(1), by Eq. dl 161) . we know E(w) is bounded, that 
is C(wft is Lipschitz. Again, since both C{wy and V 2 c,;(u;) are bounded, Lipschitz, by Eq. dl 161 ). we know 
E{w) is also 0(1)-Lipschitz. This finishes the proof. 

□ 


From now on, we can use the same proof strategy as unconstraint case. Below we list the corresponding 
lemmas and the essential steps that require modifications. 

Lemma 33. Under the assumptions of Theorem^]] with notations in Lemma\32\ for am point with \\x( w o) || > 
\J‘lr\cr 2 fiM{d ~ m) where ^2r/o 2 f5u{d — m) < e, after one iteration we have: 

E/(mi) - /(m 0 ) < -fifa 2 ) (117) 


Proof. Choose ?/ rnax < 0^-, and also small enough, then by update equation Eq. dl 121) , we have: 

E/(w>i) - /(too) < - Wo) + -7-:' In -1 - n:,, | 2 

< -h-Ofm**)? + ’ ,VfiM f- m) + 6(Xmwo)ii + 6(X) 

< -(,- «(,«) - ^)llx(»o)ll 2 + + o(X) 

r] 2 a 2 l3Md 

4 


Which finishes the proof. 


(118) 

□ 


Theorem 34. Under the assumptions of Theorem [771 with notations in Lemma\32\ for any initial point wq 
that is O ( fTf) < 5 close to a local minimum w*, with probability at least 1 — £/2, we have following holds 
simultaneously: 

Vi < 0(~2 log 7 ), II w t - m*|| < d(Jrj log ^-) < 5 (119) 

v C V vC 

where w* is the locally optimal point. 


33 





















Proof. By calculus, we know 


x( w t) = x( w *) + / (®t + 9T)(m* + t(w t - w*))dt ■ (w t - w*) (120) 

Jo 

Let filtration = 17 { £o• • ■ ■ £t_i}, and note cr{Ao, • • • , A/} c 37, where a{- \ denotes the sigma field. 
Let event = {Vr < t, \\w T — m*|| < p^jr\ log ^ < 5}, where p is independent of (rj, (), and will be 
specified later. 

By Definition [29] of (a, 7 , e, 5)-strict saddle, we know SOt(rc) is locally a-strongly convex restricted to 
its tangent space T(w). in the 25-neighborhood of w*. If r/ max is chosen small enough, by Remark IBTTI and 
Lemma [251 we have in addition: 

X (wt) T {w t - w*) lg t = {w t - w*) T [ (971 + 9T)K + t(w t - w*))dt ■ (w t - w*)l et 

Jo 

> [o:|| wt — m *|| 2 — 0(||rat — 'w ; *|| 3 )]lc t > 0 . 5 a||w 7 — w* || 2 le t (121) 

Then, everything else follows almost the same as the proof of Lemma fl5l □ 

Lemma 35. Under the assumptions of Theorem [77] with notations in Lemma \32] for any initial point wq 
where ||x( u; o)|| < Ofq) < e, and v T Wl(wo)v < —7 for some v € T(w), ||D|| = 1 , then there is a number 
of steps T that depends on wq such that: 

Ef(w T ) - f(w 0 ) < -D(? 7 ) ( 122 ) 

The number of steps T has a fixed upper bound T max that is independent of wq where T < T max = 
0((log(d — m))/ 77 ). 

Similar to the unconstrained case, we show this by a coupling sequence. Here the sequence we construct 
will only walk on the tangent space, by Lemmas in previous subsection, we know this is not very far from 
the actual sequence. We first define and characterize the coupled sequence in the following lemma: 

Lemma 36. Under the assumptions of Theorem [77] with notations in Lemma \32] Let f defined as local 
second-order approximation of f(x) around wq in tangent space To = T(w 0 ): 

f{w) = f(w 0 ) + x(w 0 ) T (w - Wo) + - w 0 ) T [Pt 0 TI(wo)Pto\(w - w 0 ) (123) 

{w /} be the corresponding sequence generated by running SGD on function f, with wo = wq, and noise 
projected to To, (i.e. wt = wt-i — rj(x(wt- 1 ) + Pj- 0 £t-i)- For simplicity, denote x(w) = S7f(w), and 
SOT = Pj- q TX(wo)Pt q > then we have analytically: 

t -1 

X(w t ) = (1 - pmYxiwo) - r?901^(l - P To ^ (124) 

T=0 

t-1 _ t-1 

Wt - w 0 = - vVK) T x(w 0 ) ~ h'ZZi 1 ~ P To £ t (125) 

T—0 r=0 
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Furthermore, for any initial point wq where ||x(if>o)|| < Off) < e, and i|i )|| =1 v T $Jl(wo)v = 

—70 There exist a T £ N satisfying: 


d — m 


Vlo 


T -1 

< < 

r=0 


3 (d — m ) 
Vlo 


with probability at least 1 — Ofq 3 ), we have following holds simultaneously for all t < T: 

\\wt ^ 0 II < 0(r/5 logi); ||x(«5t)ll ^ Ofnfl og^) 

Proof Clearly we have: 

X(wt) = xifbt- 1) + M(w t - w t - 1) 

and 

w t = w t -1 - v(x(w t - 1 ) + Pr 0 €t- 1 ) 

This le mm a is then proved by a direct application of Lemma [T71 


(126) 


(127) 

(128) 

(129) 

□ 


Then we show the sequence constructed is very close to the actual sequence. 

Lemma 37. Under the assumptions of Theorem [771 with notations in Lemma 1 32] Let {wt} be the corre¬ 
sponding sequence generated by running PSGD on function f. Also let f and {wt} l )e defined as in Lemma 
\36\ Then, for any initial point wq where ||x( u; o )|| 2 < Off) < e, and min^T-^) n , e || = . 1 v T iXftfwf)v = 
— 70 . Given the choice ofT as in Ea. Ill 26b . with probability at least 1 — Off 2 ), we have following holds 
simultaneously for all t < T: 

IK - w t \\ < 0(plog 2 -); (130) 

7] 

Proof First, we have update function of tangent gradient by: 

xfwt) =x(wt- 1 ) + / VxK_i + tfut - w t -i))dt ■ fv t - Wt- 1 ) 

7o 

=xK-i) + M(w t -i)(w t - w t - 1 ) + 9fK_i)K - w t - 1 ) + 0 t - 1 (131) 


where the remainder: 

Ot-i = [ [VxK-t + t(w t - w t - 1 )) - VxK-i)] dt ■ fu t - w t - 1 ) (132) 

7o 

Project it to tangent space 7o = Tfuf). Denote Off = Pf Q Wl(wo)Pj- 0 , and 'JJc'_, = Pf Q [ ‘l0i(ivt l ) — 
5Df(u>o) ]P%- Then, we have: 

P% ■ x(wt) =Pt 0 ■ x(wt- 1 ) + Pr 0 fBl{wt-i) + W{wt-i))(w t - w t - 1 ) + Pr o 0t-i 
=Pt 0 ■ X(w t - 1 ) + PToWl(wt-i)Pro(w t - w t - 1 ) 

+ Pr 0 OJl(wt-i)Pr 0 c (w t - w t - 1 ) + P To ‘T\(w t -i){wt - w t - 1 ) + P-TcA -1 
=Pt 0 ■ x(w t - 1 ) + ffl(w t - w t - 1 ) + ft- 1 (133) 
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Where 


4>t-i = [ W t _i + Pr 0 fXJl{w t -i)Pr^ + Pro^( w t-i) Km — m~ 1 ) + Pro^t-i 

By Hessian smoothness, we immediately have: 


IlSK'i-ill = ||9H(wti) - 53I(m 0 )|| < p M \\rn-i - Woll < PM{\\m - m\\ + \\w t - w 0 \ 

no II ^ PM + PN || 11 2 

m-i\\ < —-—1 |m - w t - 1 || 


Substitute the update equation of PSGD (Eq. dl 121) 1 into Eq. (ll33l) . we have: 


(134) 

(135) 

(136) 


Pt 0 ■ xim) = Pt 0 • x(m~i) - v^( p t 0 ■ x(m- 1 ) + Pt 0 ■ Pr{w t -i)€t-i) + W ■ h-i + 4>t -t 
= (1 - r/0Jl)P To • x(m-i) - rjMPro^t -1 + V^Pt 0 ' p T c {w t -i)^t-i + ■ it-\ + 4>t-\ (137) 

Let A* = Pj- Q ■ x( w t ) — x(m) denote the difference of tangent gradient in T(w o), then from Eq. (ll28l) . 
Eq. (ll29l) . and Eq. (ll37l) we have: 


A t = (f - rjH) At_i + rjmP To ■ Pr c {w t - 1 ) 6-1 + • H-i + <t>t -1 (138) 

t-i t- t t-i 

-Pro • (wt - wo) ~ (m - m 0 ) = -r) X A r + rj X P To • P T c( Wt )£ t + X (139) 

r=0 r=0 T—0 

By Lemma [25] we know if E2 =i % = then we have: 

II Pr 0 °(wt - wq )|| < ^ Wt (140) 

Let filtration 3) = a{£o, ■ ■ • £t_i}, and note <r{Ao,--- , At} C 3 a where cr{-} denotes the sigma 
field. Also, let event ft/ = {Vr < t, ||x(m T )|| < 0(r ]a log |), \\w T — u>o|| < 0(rj 2 log i)}, and denote 
r t = PEt=o P r 0 • Pt°(w t )£t, let (St = (Vr < t, || A T || < Mtfilog 2 ||r r || < /i 2 r/ log 2 ||m r - 
w T || < H 3 V log 2 P} where (ji\ . // 2 , /v.3) are is independent of ( 77 , £), and will be determined later. To prevent 
ambiguity in the proof, O notation will not hide any dependence on //. Clearly event ft/ _ 1 c 37- 1 ■ &t- 1 C 
3t_i thus independent of £t-i- 

Then, conditioned on event &t-i D (£t_ 1 , by triangle inequality, we have | w T — wo| < 0(772 log -), for 
all r < t — 1 < T — 1. We then need to carefully bound the following bound each term in Eq. dl 38j . We 
know wt — wt -1 = — rj ■ (x(^t-i) + Pj-r W t_i)£t-i) + H- 1 , and then by Lemmal27land Lemmal26l we have: 


H^fRPro • P'j'c 1 11 < 0 (? 7 L5 logi) 

p-tt-tll <0(7/ 2 ) 

III 9^-1 + PT 0 W(wt-i)P T c + PrMwt-i) K~v ■ xK-i))ll < 0(r log 2 b 

II[SD ^-1 + Pro^K-t)^ + Pro^K-i) K-vP-nwt-^t- t)l| < 0(77 L5 log i) 

II [ ^-r + P To m{wt-i)PTZ + Pro^K-t) ]tt-t|| < 0(77 2 ) 

||PVo^-i|| < 0(77 2 ) (141) 
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Therefore, abstractly, conditioned on event &t -1 0 (£/_ i, we could write down the recursive equation as: 


A f = (1 - rjH)A t -i + A + B (142) 

where ||A|| < 0 (r / 1,5 log |) and ||H|| < 0(rj 2 log 2 1 ), and in addition, by independence, easy to check we 
also have E[(l — rjH)At-iA\$t-i) = 0. This is exactly the same case as in the proof of LemmafUJ By the 
same argument of martingale and Azuma-Hoeffding, and by choosing /i\ large enough, we can prove 

P ^<£ t -1 n 11| At || > fiiT] log 2 “ < 0(if) (143) 

On the other hand, for I'/ = // J2r=o P% ' Pt c (w t )£t, we have: 

E[r t ls t _ 1 ne t _ 1 |&- 1 ] = [r t _i + rjK[P To • Pr c (iut_ 1 )6- 1 |5't-i]] l.s t _in<£ t _i 

= r t _ii.« t _ in «£ t _i < r t _ii^_ an(et _ a (144) 

Therefore, we have KlTtl^n,*^ | fo-i] < r f _il J?t _ 2nCt _ 2 which means Ttl^ng^ is a super¬ 
martingale. 

We also know by Lemma l27l with probability 1: 

|r t l^_inet_i - E[r f l^ t _ in( £ t _i | fo-i]| = \vPt 0 • PT°(wt-i)£t-i\ ■ ljt t _ine t _i 
<d{rj)\\w t -i - mo||l^ t _ine t -i < O ^ 15 log = c t _i (145) 

By Azuma-Hoeffding inequality, with probability less than 0(r/ 3 ), for t < T < 0(log(d — rn) / 707 ): 


— To • 1 > 0(1) 


\ 


5^4log(-) = 0(r]log 2 -) 


T —0 


(146) 


This means there exists some C 2 = 0(1) so that: 


p Rt -1 n <Sf- 1 n ||Tt|| > C 2 r, log 2 - H < 0(r/ 3 ) 


(147) 


by choosing 712 > C 2 , we have: 


p (J4t_i n £ t - 1 n < ||Tt|| > [j> 2 ti log 2 - >) < d{jf) 


Therefore, combined with Lemma[36l we have: 


(148) 


P ( £t_i n { ||Tt|| > /i 2 r/log 2 ^ \ ) < 0{rf) + P(& t -i) < 0{rf) 


(149) 


Finally, conditioned on event & t -1 0 (f/_i, if we have ||Tt|| < fi 2 ri log then by Eq. (ll39l) : 


\\Pt 0 ■ ( w t - Wo) - ( w t - w 0 ) II < O ( (m + ii 2 )t] log 2 - 

1 7 / 


(150) 
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(151) 


Since \\w t -i - wo|| < 0(rj 2 log±), and || w t - w t - 1 || < 6(r)), by Eq.( |140| ): 

\\ p T 0 c (w t - u > q )|| < ^ Wt U ’°^ < 0(77 log 2 -) 
u 2K 77 

Thus: 

IK - w t \\ 2 =\\Pt 0 • (m - w t )|| 2 + || Pr§ • ( w t - w t )|| 2 

= \\Pt 0 ■ (Wt - w 0 ) - (wt - Wo ) 11 2 + \\PT 0 c (w t - Wo)|| 2 < 0((ni + ^ 2 ) 2 f] 2 log 4 i) (152) 

That is there exist some C 3 = 0(1) so that | wt — w t \\ < 63(7x1 + 7x2)77 log 2 | Therefore, conditioned on 
event & t _ iC \<& t -\, we have proved that if choose 7x3 > (73(7x1+7x2), then event {\\ wt — Wt \\ > 7x377 log 2 C 
{||r t || > 7 x 277 log 2 |}. Then, combined this fact with Eq. d 1 431) . Eq. (ll49l) . we have proved: 

P n£() < 0 ( 77 3 ) (153) 

Because P(<£ 0 ) = 0, and T < O(-), we have P(<Bt) < 0(rf), which concludes the proof. 

□ 

These two lemmas allow us to prove the result when the initial point is very close to a saddle point. 

Proof of Lemma\35\ Combine Talyor expansion Eq[87]with Lemma l36l Lemma [37] we prove this Lemma 
by the same argument as in the proof of Lemma [16] □ 

Finally the main theorem follows. 

Proof of Theorem [771 By Lemma 1331 Lemma 1351 and Lemma 1341 with the same argument as in the proof 
Theorem [T3] we easily concludes this proof. □ 
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C Detailed Proofs for Section ID 

In this section we show two optimization problems dTTb and (fl3l) satisfy the (a , 7 , e, (i)-strict saddle propery. 


C.l Warm up: maximum eigenvalue formulation 

Recall that we are trying to solve the optimization (fill) , which we restate here. 

max T(u,u,u,u), (154) 

IMI 2 = 1. 

Here the tensor T has orthogonal decomposition T = f2i=\ a ? 4 - We first do a change of coordinates to 
work in the coordinate system specified by (aj)’s (this does not change the dynamics of the algorithm). In 
particular, let u = x '* a * (where x € M d ), then we can see T(u,u,u,u) = J2i=i x t- Therefore let 

f(x) = — 1 | x || the optimization problem is equivalent to 

min f(x) (155) 

s.t. ||rc|| | = 1 

This is a constrained optimization, so we apply the framework developed in Section [33] 

Let c(x) = |:/; 11 2 — 1. We first compute the Lagrangian 

jC(x, A) = f(x) — Ac(x) = — ||x||| — A(||x||| — 1). (156) 

Since there is only one constraint, and the gradient when ||x|| = 1 always have norm 2, we know the set 
of constraints satisfy 2-RLICQ. In particular, we can compute the correct value of Lagrangian multiplier A, 

d 

A*(x) = argmin ||V x £(x, A) || = argmin Y"V2xf + A Xi) 2 = — 2 ||x ||4 (157) 

A A 

2—1 

Therefore, the gradient in the tangent space is equal to 

X(x) = V x £(x, A)|( X)A . (x )) = V/(x) - A‘(s)Vc(x) 

= — 4 (^i,' ‘ - 1 x d) T ~ 2\*(x)(xi, ■ ■ ■ ,x d ) T 

= 4 ((x? - ||x||!)xi, • • • , (x 2 - HxH^Xd) (158) 

The second-order partial derivative of Lagrangian is equal to 

Wl(x) = V 2 x £(x, A)| (XiA . (x)) = V 2 /(x) - A*(x)V 2 c(x) 

=- 12 diag(x 2 ,--- ,x d )- 2 \*(x)I d 

= — 12diag(x 2 , • • • ,x d ) + 4||x|| \l d (159) 

Since the variable x has bounded norm, and the function is a polynomial, it’s clear that the function itself 
is bounded and all its derivatives are bounded. Moreover, all the derivatives of the constraint are bounded. 
We summarize this in the following lemma. 

Lemma 38. The objective function Ml I ) is bounded by 1, its p-th order derivative is bounded by 0(\fd) for 
p = 1,2,3. The constraint’s p-th order derivative is bounded by 2, for p = 1,2,3. 
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Therefore the function satisfy all the smoothness condition we need. Finally we show the gradient and 
ffessian of Lagrangian satisfy the (a, 7, e, <5)-strict saddle property. Note that we did not try to optimize the 
dependency with respect to d. 

Theorem 39. The only local minima of optimization problem ([77]) are ±a n (i € [d ]). Further it satisfy 
(a, 7, e, 5)-strict saddle for 7 = 7/d, a = 3 and e, 8 = 1 /poly(d). 

In order to prove this theorem, we consider the transformed version Eg 11551 We first need following two 
lemma for points around saddle point and local minimum respectively. We choose 

eo = (10d) -4 , e = 4 eg, 5 = 2deo, &(x) = {i \ \xi\ > eo} (160) 


Where by intuition, 6 (x) is the set of coordinates whose value is relative large. 

Lemma 40. Under the choice of parameters in Ea A160\) . suppose |y(.i) | < e, and G(x)| > 2. Then, there 
exists v € T(x) and ||0|| = 1, so that v T ffll(x)v < —7 /d. 

Proof Suppose | 6 (x)| = p, and 2 < p < d. Since ||x(aOII < e = 4eQ, by Eq. (ll58l ). we have for each 
i € [d], |[x( a; )]j| = 4|(x? — \\x\\\)xi\ < 4eg. Therefore, we have: 

Mi € 6(x), \xj — ||x|||| < eo (161) 


and thus: 


1 M 4 1 1 in 114 

\x 4 -= x 4 - 

p p 




x. 




1 

p 


i£<3(x) 

Combined with Eq l 1611 this means: 


P 


iG[d]—S(x) 


21/ , d ~P 2 / o 

Xi I < eo H-— e 0 < 2e o 


(162) 


Mi € 6(x), \xj — -I < 3eo (163) 

P 

Because of symmetry, WLOG we assume & (x) = {1, • • • ,p}. Since |@(a:)| > 2, we can pick v = 
(a, b, 0, • • • , 0). Here a > 0, b < 0, and a? + b 2 = 1. We pick a such that ax 1 + bx 2 = 0. The solution is 
the intersection of a radius I circle and a line which passes (0,0), which always exists. For this v, we know 
||t)|| = 1, and v T x = 0 thus v € T{x). We have: 


v T 9Jl(x)v = —( 12 x 4 +4||x|||)a 2 — (12x| + 4 ||x|| 4 ) 6 2 
= — 8 x\a 2 — 8 x\b 2 — 4(x 2 — ||x|| 4 ))a 2 — 4(x| — ||x|||)) 6 2 

< - - + 24e 0 + 4e 0 < -7/d (164) 

P 


Which finishes the proof. □ 

Lemma 41. Under the choice of parameters in Ea A16(M . suppose | j x (- x ') I ^ T an d <3(x')| = 1. Then, 
there is a local minimum x* such that \\x — x* || < 8, and for all x' in the 28 neighborhood of x*, we have 
v T 'JJt(x')v > 3 for all v € T(x'), ||0|| = 1 
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Proof. WLOG, we assume G(x) = {1}. Then, we immediately have for all i > 1, \xi\ < eo, and thus: 

1 > xf = 1 - x2 i ^ 1 - de o (165) 

i> 1 

Therefore x\ > yjl — de^ or x\ < — y/l — de^. Which means x\ is either close to 1 or close to — 1. By 
symmetry, we know WLOG, we can assume the case x\ > yG — de^. Let e± = (1, 0, • • • ,0), then we 
know: 

||x — ei|| 2 < ( x\ — l) 2 + ^ x 2 < 2deo < S 2 (166) 

i> 1 

Next, we show e\ is a local minimum. According to Eg 1 1 591 we know Wl(ei) is a diagonal matrix with 4 
on the diagonals except for the first diagonal entry (which is equal to — 8 ), since T(ei) = span{e 2 , • • • , ey}, 
we have: 


v T Wl(ei)v > 4||n || 2 > 0 for all v € T(ei), v / 0 (167) 

Which by Theorem [241 means e\ is a local minimum. 

Finally, denote 71 = T(ei) be the tangent space of constraint manifold at ei. We know for all x' in the 
2 5 neighborhood of e\, and for all v € T(x'), |'0| = 1 : 

v T Wl(x')v >v T Wl(ei)v — \v T Wl(ei)v — v T Wl(x')v\ 

=4||P Tl i )|| 2 - 8 ||P r G || 2 - ||0Jl(ei) - ^(xOlHIfill 2 

=4 - 12||7V i c'0 || 2 - ||ajt(ei) - 9H(x')|| (168) 

By lemma l26l we know HTV^HII 2 G \\x' — ei || 2 < 4 5 2 . By Eq. (ll59l) . we have: 

|| 2 H(ei) - m(x')\\ < || 0 H(ei) - 97t(x')|| < ^ IPG(ei)]y - [9G(x')]y| 

fid) 

<^|-12[ei ] 2 + 4||ei|||-12x 2 + 4||a;|||| <64 d5 (169) 

i 

In conclusion, we have v T 9Jl(x')v > 4 — 485 2 — 64 d5 > 3 which finishs the proof. □ 

Finally, we are ready to prove Theorem l39l 

Proof of Theorem 1391 According to Lemma l40l and Lemma I4T1 we immediately know the optimization 
problem satisfies (a, 7 , e, <5)-strict saddle. 

The only thing remains to show is that the only local minima of optimization problem (fTTb are +a, (i €= 
[d]). Which is equivalent to show that the only local minima of the transformed problem is ±e. t (i € [d ]), 
where e, = ( 0 , • • • , 0 , 1 , 0 , • • • , 0 ), where 1 is on i-th coordinate. 

By investigating the proof of Lemma [40] and Lemma [44] we know these two lemmas actually hold for 
any small enough choice of eo satisfying eo < ( 10 d) -4 , by pushing eo —>• 0 , we know for any point satisfying 
|x(x)| < e —>• 0, if it is close to some local minimum, it must satisfy 1 = |<5(x)| —>■ supp(x). Therefore, 
we know the only possible local minima are ±e t (i € \d\). In Lemma I4T1 we proved ei is local minimum, 
by symmetry, we finishes the proof. □ 


41 







C.2 New formulation 

In this section we consider our new formulation ([T3T ). We first restate the optimization problem here: 

min E T(u^,u^,u^,u^), (170) 

* 7 1 3 

Vi ||ii^ || 2 = 1. 

Note that we changed the notation for the variables from m to v/' l K because in later proofs we will often 
refer to the particular coordinates of these vectors. 

Similar to the previous section, we perform a change of basis. The effect is equivalent to making cq’s 
equal to basis vectors e, (and hence the tensor is equal to T = JV =1 ef 4 . After the transformation the 
equations become 

min E h{u^,u U) ) (171) 

s.t. ||«W|| 2 = 1 Vi € [d] 

Here h(u^\u^) = Ylt=i( u k^ u k^) 2 ’ (*> j) € [d] 2 . We divided the objective function by 2 to simplify the 
calculation. 

Let U € be the concatenation of {rtW} such that U{j = u^\ Let Ci(U) = ||r£^|| 2 — 1 and 
f(U) = i £ W ):„y h(u^\u^). We can then compute the Lagrangian 

d ^ d 

C(U, A) = f(U) - J2 M U ) = 2 E h(u®,u®) - Y Ai(||« (i) || 2 - 1) (172) 

i=1 * =1 

The gradients of Ci(U)’ s are equal to (0, • • • , 0, 2vS l >, 0, • • • , 0) T , all of these vectors are orthogonal to 
each other (because they have disjoint supports) and have norm 2. Therefore the set of constraints satisfy 
2-RLICQ. We can then compute the Lagrangian multipiers A* as follows 

A *{U) = argmin ||Vj/£(L7, A)|| = argmin4 V V( V U 2 k U ik - A iU ik ) 2 (173) 

A A 

i k j:j^i 

which gives: 

A *(U) = argmin^( Y U 2 k U ik - \iU ik ) 2 = Y (174) 

k j: j / i j:j/i 

Therefore, gradient in the tangent space is equal to 

n 

X (U) = V v C{U, = V/(C7) - ]T A *dU)Va(U). (175) 

2—1 

The gradient is a d 2 dimensional vector (which can be viewed as a dx d matrix corresponding to entries 
of U), and we express this in a coordinate-by-coordinate way. For simplicity of later proof, denote: 

MU) = YK- h(uW, U W)} =YK~Y ( 176 ) 

r-j¥=i r-o^i 2=1 
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Then we have: 


[ X (U)\ik = Uf k -X*(U))U ik 

ri & 

= 2 U ik J^{U] k -h(u^,u^)) 

= 2 U ik Ak(U) (177) 


Similarly we can compute the second-order partial derivative of Lagrangian as 

d 


m(u) = v 2 f(u) - ^A*v 2 Ci (c/). 


(178) 


i= 1 


The Hessian is a d 2 x d 2 matrix, we index it by 4 indices in [d]. The entries are summarized below: 


mu)] ik ,i' k ' = j^—[VuC{U,\)] ik 

uUi'k' 


d 


o u,\*m dUi>kl 


[^E^-a )u ik ] 




= 


= 


'2(z r ,^uf k -\*m if k = k',i=i' 

dUi' k Ui k if k — k ,i f i 

0 if k f k' 

2 il>i k (U) if k = k', i = i' 

W Vk U ik if k — k ,i i 
0 if k 7 ^ k! 


(179) 


Similar to the previous case, it is easy to bound the function value and derivatives of the function and 
the constraints. 


Lemma 42. The objective function 1/7D and p-th order derivative are all bounded by poly (d) for p = 1,2,3. 
Each constraint’s p-th order derivative is bounded by 2, for p = 1,2,3. 

Therefore the function satisfy all the smoothness condition we need. Finally we show the gradient and 
Hessian of Lagrangian satisfy the (a, 7, e, ri) -strict saddle property. Again we did not try to optimize the 
dependency with respect to d. 

Theorem 43. Optimization problem f l/7D has exactly 2 d ■ d\ local minimum that corresponds to permutation 
and sign flips of cii’s. Further, it satisfy (a, 7, e, 5)-strict saddle for a = 1 and 7, e, 5 = 1 /poly(d). 

Again, in order to prove this theorem, we follow the same strategy: we consider the transformed version 
Eq il7 1 1 and first prove the following lemmas for points around saddle point and local minimum respectively. 
We choose 

e 0 = ( 10 d) -6 , e = 2 cq, S = 2 de 0 , 7 = Cq/4, 6(u) = {k \ \u k \ > e 0 } (180) 

Where by intuition, &(u) is the set of coordinates whose value is relative large. 

Lemma 44. Under the choice of parameters in Eq AlSOk suppose ]y(f/)| < e, and there exists ( i,j ) € [d] 2 
so that 6(u M) n 6(uf 0. Then, there exists v € T(U) and ||D|| = 1, so that v T d.7t(U)v < — 7 . 
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Proof. Again, since ||x(^)|| < e = 2e[j, by Eq.( |177| ), we have for each i <E [d\, |[\'(x)] ifc | = 2\U ik Ak(U)\ < 
2cq. Therefore, have: 

V£:€ 6 (n«), \MU)\ < e 5 0 (181) 

Then, we prove this lemma by dividing it into three cases. Note in order to prove that there exists 
v € T(U) and ||D|| = 1 , so that v T ?0t(U)v < — 7 ; it suffices to find a vector v € T{U) and ||v|| < 1, so 
that v T Wl(U)v < — 7 . 

Case 1 : | 6 (uW)| > 2, |©(u^)| > 2, and |©(«W) n &(u^)\ > 2. 

WLOG, assume {1,2} € &(u^) n choose v to be v A = v t 2 = — Vj\ = ^ and 

Vj 2 = — l -j-- All other entries of v are zero. Clearly v € T{U), and j?;|| < 1 . On the other hand, we know 
9Jl(U) restricted to these 4 coordinates (H,i2,jl,j2) is 

/ 2ijjii(U) 0 4U u U n 0 \ 

0 2 0 4U i2 U j2 n8? , 

4 UaUjt 0 2tpji(U) 0 ^ ; 

\ 0 4:Ui2Uj2 0 2^ j2 (U) ) 

By Eq. (I 181I ). we know all diagonal entries are < 2 cq. 

If U A Uj X U i2 Uj 2 is negative, we have the quadratic form: 

v T m(U)v =UaU j iUi 2 U j 2 + \[Ul^n{U) + U^ i2 (U) + Uj 2 ipji(U) + U^ j2 {U)\ 

^ — e o + e o — — ^ e o = ~7 (183) 

If U{ \ Uj\Ui 2 Uj 2 is positive we just swap the sign of the first two coordinates v A = — v ,2 = ■%- and the 

above argument would still holds. 

Case 2 : | 6 (n«)| > 2, |©(u(j))| > 2, and |©(uW) n &(u^)\ = 1. 

WLOG, assume {1,2} € 6 (u^) and {1,3} € 6(u^), choose v to be v A = va = — 

Vj\ = ^4 an d All other entries of v are zero. Clearly v € T(U) and | v\ < 1. On the other 

hand, we know Wl(U) restricted to these 4 coordinates (H,i2,jl,j3) is 


/ 2^(17) 

0 

WaUj! 

° ^ 

0 

2^2 (U) 

0 

0 

4k} 1 Uj 1 

0 

2^1 (U) 

0 

V 0 

0 

0 

2^3 (£0 ) 


By Eq. dlSlI ). we know all diagonal entries are < 2 eq. If Ui\Uj\Ujz is negative, we have the quadratic 
form: 

v T m(U)v =^UnU j iUi2U j z + \{UlilHx(U) + U?MU) + Uf^U) + U^ j3 (U)} 

< — 2 e o + e o — — 4 e o = ~7 (185) 

If (7j 1 Uj 1 Uj/ 2 Uj 3 is positive we just swap the sign of the first two coordinates v A = — v-a = ■%- and the 
above argument would still holds. 
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Case 3 : Either |©(uW)| = 1 or |©(u^)| = 1. 

WLOG, suppose |©(uW)| = 1, and {1} = ©(r/W), we know: 

|rf } ) 2 -l| < (d-l)e 2 0 (186) 

On the other hand, since ©(«W) n &(u^) / 0, we have ©(ifW) n &(u^) = {1}, and thus: 

\MU)\ = I E U i'i - E I < e* (187) 

Therefore, we have: 

E h(u^\u^) > J2 U $ 1 -4>U*-4>1- del (188) 

and 

Ew £ o= E Jl u i'k- d E % (i,) ^ a) ) 

fc=l fc=1 

<d — 1 — d(l — deg) = —1 + d 2 eg (189) 

Thus, we know, there must exist some k! G \d], so that ipjk'iU) < — ^ + deg. This means we have “large” 
negative entry on the diagonal of 911. Since \ipji(U)\ < e[j, we know k' / 1. WLOG, suppose k! = 2, we 
have |'(/ 72 (G) | > e(j, thus \U j2 \ < e 0 . 

Choose v to be Vj\ = '-pf, Vj 2 = — ^t. All other entries of v are zero. Clearly v € T(U) and | v\ < 1. 
On the other hand, we know DJl(U) restricted to these 2 coordinates is 


( 2^t(l7) 0 \ 

V 0 2^-2 (U) ) 


(190) 


We know \Uji\ > eo, \Uj 2 \ < eo, \^j\{U)\ < eg, and fj 2 (U) < — g + deg. Thus: 

v T M(U)v = \fji(U)U 2 2 + \^ j2 {U)Uj i 

<eS-(^-deg)eg<-^eg<-7 (191) 

Since by our choice of v, we have |v \ < 1, we can choose v = v /11vj 11, and immediately have v eT(U) 
and ||D|| = 1, and v T Wl(U)v < — 7 . □ 

Lemma 45. Under the choice of parameters in Ea. M80\ . suppose |x(C) || < e, and for any (i. j ) G [d] 2 we 
/mve ©(?/(*)) 0 ©(m( j 1) = 0. 77zen, t/iere A a /oca/ minimum U* such that \\U — U*\\ < 6, and for all U' in 
the 26 neighborhood ofU*, we have v 1 ^fR{U')v > 1 for all v € T(U'), ||D|| = 1 

Proof WLOG, we assume <5[vS 1 ') = {?'} for i = 1, • • ■ , d. Then, we immediately have: 


l^l < «o, IK W ) 2 - 1| < (d — l)eg, V(t,j) € [d] 2 ,j / i (192) 

Then > yG — del or u — — \A — de o- Which means uf* is either close to 1 or close to —1. By 
symmetry, we know WLOG, we can assume the case uf^ > yG — deg for all i G [d]. 
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Let V € M d2 be the concatenation of {e±, 62 , -■■ , e^}, then we have: 

||U - E|| 2 = ll« (<) - ei|| 2 < 2d 2 e 2 0 < 5 2 (193) 

2=1 

Next, we show E is a local minimum. According to Eq i 1791 we know 911(E) is a diagonal matrix with 
d 2 entries: 


m(v)\ik,ik = 2 mv) = 2x] [yfk- E v * v $ = {0 (194) 

1=1 l U 

We know the unit vector in the direction that corresponds to [911(1/)]^ is not in the tangent space 'T(V) 
for all i € \d]. Therefore, for any v £ T(V), we have 

v T Wl(ei)v > 2 \\v \\ 2 > 0 for all v € T(V), v / 0 (195) 

Which by Theorem [24]means V is a local minimum. 

Finally, denote Tv = 'T( V) be the tangent space of constraint manifold at V. We know for all U' in the 
25 neighborhood of V, and for all v € T(x'), ||D|| = 1: 

v T m{u')v >v T m(v)v - \v T m(v)v - v T m{u')v\ 
=2\\P Tv v\\ 2 -\mV)-m(U')\\\\v\\ 2 

=2 - 2 \\P T cv \\ 2 - 11911(E) - 9H(E / )|| (196) 

By lemma l26l we know 11^11 2 < || U' — E|| 2 < AS 2 . By Eq. (I 179I ). we have: 

||9H(E) - 9H(E')|| < ||911(E) - 9H(E')|| < £ \[m(V)] ikJk - [fm(U% kJk \ < 100 d 3 S (197) 

(hj,k) 

In conclusion, we have v T Wl(U')v > 2 — 8 5 2 — 100d 3 <5 > 1 which finishs the proof. □ 

Finally, we are ready to prove Theorem l43l 

Proof of Theorem \43\ Similarly, (a, 7 , e, <5)-strict saddleimmediately follows from Lemma 1441 and Lemma 

M 

The only thing remains to show is that Optimization problem (fl3T) has exactly 2 d ■ d\ local minimum that 
corresponds to permutation and sign flips of s. This can be easily proved by the same argument as in the 
proof of Theorem l39l □ 
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